Hey there, data enthusiasts! 🎀

In the exciting world of data science and machine learning, one of the first and most crucial steps is turning raw data into a format that our models can understand and learn from. This process, called data preprocessing, involves several important steps:

Data Cleaning: Removal of noise and inconsistent data. Let's say there was a feature with 80% null values. will you still keep it? What about 20% null values. Those can easily be filled with statistics like mean of all categorical data.
Data Integration: Combine multiple dataset sources for better predictions. Eg. combining driver's medical record with race and season data to predict their position in an F1 race. While the health wouldn't be much helpful but using that as a weight for previous race position will drastically increase its importance!
Data Selection: Selection important and useful data. Try doing feature engineering and get the best features for your model.
Data Transformation: Data are transformed and consolidated for mining by performing encodings and feature engineering. I consider this as the most important topic before data mining since without encoding, data mining is useless and unhelpful.
Data Mining: Intelligent methods are applied to extract data patterns OR Extraction of implicit, previously unknown and potentially useful information from data. Eg. using the race year and DOB of driver to find out the age of the driver to provide new insights while removing 2 columns from model.
Pattern Evaluation: Identify the truly fascinating pattern using various evaluation metrics.
Knowledge Presentation: Create graphs and stats like charts, heatmaps, and much more. Understand your data and improvise wherever needed using above steps.

Central to this preprocessing is the task of encoding. This blog delves into the various encoding methodologies, providing a comprehensive analysis of them.

Importance of Encoding

Encoding is a crucial step in the data preprocessing pipeline, especially when dealing with categorical data. Categorical variables, which represent data that can be divided into specific groups or categories, often need to be converted into a numerical format for machine learning algorithms to process them effectively. This conversion process is known as encoding. Machine learning models typically require numerical input because they are based on mathematical calculations that cannot interpret categorical data directly. By transforming categorical data into numerical values through various encoding techniques, we can ensure that our models can leverage all available information, leading to better performance and more accurate predictions. Encoding not only makes data suitable for analysis but also helps preserve the relationships and characteristics inherent in the original categorical variables.

Prerequisites

No sane person codes on paper, he who codes on paper has mastered the essence of coding or the truth behind the universe itself. - ME🎀

Install the following required Python libraries

pip install scikit-learn pandas category_encoders

Different datasets requires different encoding methods. Therefore, different examples might get used for each encoding methods.

Types of Encoding

While there are hundreds of encoding methods, we will focus on the most important and widely used ones.

Multi-Hot Encoding
Label Encoding
Ordinal Encoding
Binary Encoding
Target Encoding
Frequency Encoding

Multi-Hot Encoding

This method converts into binary-like data. Categorical values is mapped to a binary vector of length equal to the no. of categories. This method is usually used in classification models.

Example: Imagine you have a dataset of music tracks.

Name	Artist	Genre
Fly Me to the Moon	The Macarons Project	["slow", "acoustic", "pop"]
Mad at Disney	Salem ilese	["dance", "pop"]

Here, the genre is a feature we need to encode since providing array of multiple genre-names would be ineffective to the model.

from sklearn.preprocessing import MultiLabelBinarizer
import pandas as pd

# Creating the dataframe with list of genres per song
df = pd.DataFrame({
    "name": ["Fly Me to the Moon", "Mad at Disney"],
    "artist": ["The Macarons Project", "Salem ilese"],
    "genre": [["slow", "acoustic", "pop"], ["dance", "pop"]]
})

# Using MultiLabelBinarizer to handle the list of genres
mlb = MultiLabelBinarizer()
x_encoded = mlb.fit_transform(df["genre"])

# Creating the encoded dataframe
encoded_df = pd.DataFrame(x_encoded, columns=mlb.classes_)

# Concatenating the original columns with the encoded genres
df_final = pd.concat([df.drop(columns=["genre"]), encoded_df], axis=1)

print(df_final)

name	artist	acoustic	dance	pop	slow
Fly Me to the Moon	The Macarons Project	1	0	1	1
Mad at Disney	Salem ilese	0	1	1	0

The data is encoded with the genres where 1 means HOT (or present) and 0 means COLD (or absent). A similar approach can be taken with One-Hot Encoding but binary Encoding or Label Encoding is better in those cases most of the time.

Label Encoding

This method converts each categorical value into a numerical data.

Similar to multi-hot encoding in a way. The only key difference would be that Label Encoding might inadvertently introduce ordinal relationships where none exist, which can mislead some algorithms. multi-hot encoding avoids this by treating each category independently.

Example: A company sells shirt of different sizes and colours for X amount of price.

Colour	Size	Company	Price
red	L	Max	300
blue	S	ACM	230
red	XL	Zara	568
green	S	Gucci	927

where we need to use encoding for all 3 columns Colour, Size, and Company. We will use Label Encoding since that addition to bias can help model to predict with better accuracy.

import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Creating the dataframe
df = pd.DataFrame({
    'Colour': ['red', 'blue', 'red', 'green'],
    'Size': ['L', 'S', 'XL', 'S'],
    'Company': ['Max', 'ACM', 'Zara', 'Gucci'],
    'Price': [300, 230, 568, 927]
})

# Label Encoding for 'Colour', 'Size', and 'Company'
label_encoder = LabelEncoder()
df['Colour_encoded'] = label_encoder.fit_transform(df['Colour'])
df['Size_encoded'] = label_encoder.fit_transform(df['Size'])
df['Company_encoded'] = label_encoder.fit_transform(df['Company'])

# Drop the original categorical columns after encoding
df_final = df.drop(columns=['Colour', 'Size', 'Company'])

print(df_final)

Price	Colour_encoded	Size_encoded	Company_encoded
300	2	0	2
230	0	1	0
568	2	2	3
927	1	1	1

The numerical value here is assigned by sorting (alphabetically or numerically) the categories by default but if we want to intentionally give a preference to this encoding then we should look into Ordinal Encoding

Ordinal Encoding

Similar to Label Encoding with the only difference that we ourselves provide a specific order of importance to the categories (unlink how label encoder sorted all categories to provide numbering to it).

Example: In the Label Encoding example, the company should be in your preference order since we know companies like Gucci or Zara will sell T-shirts at expensive prices.

Colour	Size	Company	Price
red	L	Max	300
blue	S	ACM	230
red	XL	Zara	568
green	S	Gucci	927

Let's use ["ACM", "Max", "Zara", "Gucci"] as our order of cheap to expensive T-shirts.

import pandas as pd
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder

# Creating the dataframe
df = pd.DataFrame({
    'Colour': ['red', 'blue', 'red', 'green'],
    'Size': ['L', 'S', 'XL', 'S'],
    'Company': ['Max', 'ACM', 'Zara', 'Gucci'],
    'Price': [300, 230, 241, 927]
})

# Label Encoding for 'Colour' and 'Size'
label_encoder_colour = LabelEncoder()
label_encoder_size = LabelEncoder()

df['Colour_encoded'] = label_encoder_colour.fit_transform(df['Colour'])
df['Size_encoded'] = label_encoder_size.fit_transform(df['Size'])

# Ordinal Encoding for 'Company' with the specified reversed order
company_order = ["ACM", "Max", "Zara", "Gucci"]
ordinal_encoder = OrdinalEncoder(categories=[company_order])

df['Company_encoded'] = ordinal_encoder.fit_transform(df[['Company']])

# Drop the original categorical columns after encoding
df_final = df.drop(columns=['Colour', 'Size', 'Company'])

print(df_final)

Price	Colour_encoded	Size_encoded	Company_encoded
300	2	1	1
230	0	0	0
241	2	2	2
927	1	0	3

This adds bias to the model depending upon the company name.

Binary Encoding

This method converts each categorical value into binary digits (0s and 1s) then store them as separate columns. This is useful when you have many categories to encode and want to reduce dimensionality compared to multi-hot encoding.

Converts each category into binary code and then split the binary digits into separate columns. Results in log2(N) amount of columns while multi-hot encoding would provide (N) columns.

Example: Encoding just the Colours into something suitable.

Colour
Red
Green
Blue
Red

import pandas as pd
from category_encoders import BinaryEncoder

# Sample data
data = pd.DataFrame({'Colour': ['Red', 'Green', 'Blue', 'Red']})

# Create a BinaryEncoder object
encoder = BinaryEncoder(cols=['Colour'])

# Encode the categorical feature
encoded_data = encoder.fit_transform(data)

print(encoded_data)

Colour_0	Colour_1
0	1
1	0
1	1
0	1

Most of the time, if the categories are less. We should use multi-hot encoding or label encoding.

Target Encoding

Also known as Mean Encoding or Livelihood encoding. This method encodes the categorical values by replacing each category with statistics of the target variable in that category.

Highly recommended and very useful for handling high cardinality categorical variables. This captures relationship between the categorical variables and the target variable more effectively than one-hot encoding.

Formula:

Encoding Value = \frac{(n \times Categorical Mean) + (m \times Global Mean)}{n + m}

here:

n: No. of samples.
m: smoothing parameter.

Example: In house prediction model, encoding neighborhood names wth mean of house price in those area would provide more insights than just normal label encoding.

House Number	Price	Neighborhood	Size (sq meter)
1	500000	Downtown	200
2	350000	Suburb	150
3	700000	City Center	300
4	450000	Suburb	180
5	600000	Downtown	250

import pandas as pd

# Original dataset
data = {
    'House Number': [1, 2, 3, 4, 5],
    'Price': [500000, 350000, 700000, 450000, 600000],
    'Neighborhood': ['Downtown', 'Suburb', 'City Center', 'Suburb', 'Downtown'],
    'Size (sq meter)': [200, 150, 300, 180, 250]
}

df = pd.DataFrame(data)

# Calculate mean price for each neighborhood
neighborhood_means = df.groupby('Neighborhood')['Price'].mean().to_dict()

# Map mean prices back to the original dataset
df['Neighborhood'] = df['Neighborhood'].map(neighborhood_means)

# Display the encoded dataset
print(df)

House Number	Price	Neighborhood	Size (sq meter)
1	500000	550000.0	200
2	350000	400000.0	150
3	700000	700000.0	300
4	450000	400000.0	180
5	600000	550000.0	250

Frequency Encoding

This method replaces each categorical value with its frequency or count within the training dataset.