To get started with analysis on your data, the first thing you need to do is to bring your data in a proper shape and size, and in a cleaned format. This will enable the users to start analysis or application of ML algorithms on that data set. Now if you are using Python to do all the cleaning of data, the two major libraries which would help you are numpy and pandas.
The way I will demo this to you is by adding code snippets and comments. The way I did this was I started working in a Kaggle kernel and you can also take a data set and work on it.
#importing the numpy, pandas and matplotlib to get started.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#reading the data set to df as data frame to get started
df = pd.read_csv(‘/kaggle/input/new-york-city-airbnb-open-data/AB_NYC_2019.csv’)
#to get the details on the data set such as number of nulls, non-nulls and column names. df.info()
RangeIndex: 48895 entries, 0 to 48894
Data columns (total 16 columns):
id 48895 non-null int64
name 48879 non-null object
host_id 48895 non-null int64
host_name 48874 non-null object
neighbourhood_group 48895 non-null object
neighbourhood 48895 non-null object
latitude 48895 non-null float64
longitude 48895 non-null float64
room_type 48895 non-null object
price 48895 non-null int64
minimum_nights 48895 non-null int64
number_of_reviews 48895 non-null int64
last_review 38843 non-null object
reviews_per_month 38843 non-null float64
calculated_host_listings_count 48895 non-null int64
availability_365 48895 non-null int64
dtypes: float64(3), int64(7), object(6)
#to get the basic information on a data set, like means, counts, std, mins
df.describe()
#to get the shape of the data set which is the number of rows and columns, you can use the shape
df.shape
# to look at values of index
df.index
#to get the column names
df.columns
#to look at the values
df.values
#you can also rename the columns
#inplace will make the change permanent
df.rename(columns={‘original name’:’new name’},inplace=True)
#Slicing and Subsetting –
#To subset a dataframe, the tools you will be using is the [ ], loc and iloc.
#loc is label based, you need to pass the label names
#iloc is integer index based, you need to provide the index locations of the rows or columns
#to get second column values, values gives array
df.iloc[:,2:3].values
# to get the rows 3,4,5 for first column
df[[‘id’]].values[2:5,0:1]
#to get the rows 0, 1, 2
df[0:3]
# Select the first 5 rows (rows 0, 1, 2, 3, 4)
df[:5]
# Select the last element in the list
# (the slice starts at the last element, and ends at the end of the list)
df[-1:]
#to get a particular column by column name
#the below will actually return a Series
df[‘host_name’]
#to select multiple columns, you need to provide the list of column names
#this returns a data frame
df[[‘host_name’, ‘name’, ‘id’]]
#to filter to a particular condition
df.loc[df[‘name’]==’Clean & quiet apt home by the park’]
# now in the same above case you want to get from one column to other
#you will get columns from name to lat for particular condition
df.loc[df[‘name’]==’Clean & quiet apt home by the park’,’name’:’lat’]
#in this case the index are numbers so we will have to pass numbers in both the cases, be it loc or iloc
#But still there is a subtle difference in the use of loc and iloc
#to get just row 1
df.iloc[1:2]
#to get rows 1 and 2
df.loc[1:2]
#now with this we can subset the columns as well
#you will get columns from 2 to 5
df.iloc[1:2,2:6]
# now we cannot run the same with loc as the column names are not the numbers as it was for the indexes
#so you can mention it like
df.loc[1:2,’id’:’name’]
#row subsetting
#to get the 0,2,5 rows from data frame.
df.iloc[[0,2,5]]
#you can also set another index
#wherever you need the dataframe to be modified permanently, use the inplace = True
df.set_index(‘id’, inplace=True)
#to drop rows or columns, you can use the drop syntax
#the 0 here stands for rows
#1 will stand for columns
df.drop([2539,3647,5099],axis=0)
#this will drop a column
df.drop([‘last_review’], axis =1)
#But remember that the drop is temporary as it is not accompanied by inplace
#to look at some specific rows
#for rows where price =149
df[df[‘price’]==149]
#to get a boolean array where the values are true or false based on a particular condition
df[‘price’]==149
#if you want to get rows in descending order, and you also want to specify a step size, you can this
df[8,1,-2]
#the above line will provide you with the data frame where rows will start from 8 and with a step size of -2 go till row 1.
#if you need rows 1 to 5 but for only one column id, you can try this
df.loc[1:5,[‘id’]]
#Group by
#if you want to group by a certain column and look at the sum or average or any other stat, you can try this
df.groupby(‘host_name’)[‘price’].sum()
#Now lets try some data wrangling concepts using lists which you can use in your data frames as well
A = np.array([1,2,3,4])
cond = np.array([True, False, True, True])
A[cond]
#this will provide you all the true array values
#use of list comprehension to create a list selectively
A = np.array([1,2,3,4])
B = np.array([100,200,300,400])
ans = [a_val if cond else b_val for a_val,b_val,cond in zip(A,B,cond)]
ans
#this will provide you the output of [1,200,3,4]
#Where can also be used for conditional selection
a1 = [1,2,3,4]
b1 =[100,200,300,400]
c1=[True,False,True,False]
result = np.where(c1,a1,b1)
result
[1,200,3,400]
#Some uses
df.min()
df.describe()
df.isnull()
df.dropna(how=’all’)
df.dropna(how=’all’,axis=1)
df.dropna(thresh=3)
df.fillna(1111)
df.fillna({0:1})
df.fillna(df.mean())
#Lets look at how to join two data frames
dframe3 = pd.DataFrame({“key”:[‘X’,’X’,’X’,’Y’,’Z’,’Z’],”data_set_3″:range(6)})
dframe4 = pd.DataFrame({“key”:[‘Y’,’Y’,’X’,’Y’,’Z’],”data_set_4″:range(5)})
pd.merge(dframe3,dframe4, how=’inner’, on=’key’)
#you can use different combinations to get different types of joins
#Using Concat
serx1 = pd.Series(np.arange(4), index=[‘A’,’B’,’C’,’D’])
serx2 = pd.Series(np.arange(4), index=[‘A’,’B’,’C’,’D’])
pd.concat([serx1,serx2],axis=1)
#So now if we pass axis = 1 then the concat will join the two series to make two columns
#If we used the axis as 0, we would have received the data frame with two series appended
#Pivoting the data table
#remember that the format is rows, columns, values
studentData = {
‘name’ : [‘jack’, ‘Riti’, ‘Aadi’,’jack’, ‘Riti’, ‘Aadi’],
‘age’ : [34, 30, 16, 22, 45, 67],
‘city’ : [‘Sydney’, ‘Delhi’, ‘New york’,’Bhopal’,’Indore’,’Mumbai’]
}
pd.DataFrame(studentData).pivot(‘name’,’city’,’age’).fillna(0)