Data Pre-processing using Scikit-learn:
Data pre-processing is one technique of data mining using that you can convert your raw data into an understandable format. In his practical, we will take one dataset and perform the following task
(1)Data Encoding
(2)Normalization
(3)Standardization
(4) Imputing the Missing Values
(5) Discretization
Dataset Description:
I have used ‘California Housing Prices dataset’. This dataset contains information about longitude, the latitude of ocean proximity area, population, number of beds, number of rooms, house price, etc…
You can download the dataset by clicking the below link:
Dataset: California Housing Prices dataset
Data Encoding:
Data encoding is the transformation of categorical variables to binary or numerical counterparts. In this, we assign unique values to all the categorical attributes. An example is to treat male or female for gender as 1 or 0. so there are two types so data encoding (1)label encoding (2)Onehot encoding.
(1)Label encoding:
If we will have more than one category in the dataset that to convert those categories into numerical features we can use a Label encoder. Label Encoder will assign a unique number to each category.
As you can see ‘median_house_value’ column has 3842 categories that is nothing but house ranges. After Using Label Encoder we labeled the data. The 500001 housing range converted to 3841, 137500 housing range converted to 959, 162500 housing range converted to 1209, and so on…
classes_ attribute is helping us to identify numerical categories for particular label categories. ( 0 index: 14999 house range, 1 index: 17500 house range…)
(2)Onehot encoder:
One hot encoder does the same things but in a different way. Label Encoder initializes the particular number but one hot encoder will assign a whole new column to particular categories. So if you have 3 categories in the column then one hot encoder will add 3 more columns to your dataset.
Now it totally depends on the dataset and its behavior. One Hot Encoder will increase the dimensional but it is useful most time because in the label encoder sometimes all the numerical categories will compare with each other by machine so it will make wrong assumptions. So that’s why OneHot is used more in the real world. But I advise you to do an experiment with both.
Normalization:
Normalization is a scaling technique in which values are shifted and rescaled so that they end up ranging between 0 and 1. because in real-world data is not available on the same scale. Data columns will always have different scales. So to make all the columns in one scale we can use normalization methods.
MinMaxScaler : For each value in a feature, MinMaxScaler subtracts the minimum value in the feature and then divides by the range. The range is the difference between the original maximum and the original minimum.
Standardization:
Standardization is another scaling technique where the values are centered around the mean with a unit standard deviation. This means that the mean of the attribute becomes zero and the resultant distribution has a unit standard deviation(i.e. standard deviation = 1).
Imputing Missing Values:
Missing data are values that are not recorded in a dataset. They can be a single value missing in a single cell or missing an entire observation (row). Missing data can occur both in a continuous variable (e.g. height of students) or a categorical variable (e.g. gender of a population).
We can handle missing values in two ways. : (1) Remove the data (whole row) which have missing values.(2) Add the values by using some strategies or using Imputer.
Simple Imputer:
Discretization:
Data discretization is the process of converting continuous data into discrete buckets by grouping them. by doing this we can limit the number of possible states. basically, we convert the numerical features into categorical columns.
There are 3 types of Discretization available in Sci-kit learn. (1) Quantile Discretization Transform (2) Uniform Discretization Transform (3) KMeans Discretization Transform.
Quantile Discretization Transform:
Uniform Discretization Transform:
KMeans Discretization Transform: