Building a ml model - part 2

Aryan Tah
Sep 10, 2024
4 min read

Before we can use data to create machine learning models, we need to filter the data and make it suitable for machine learning models to work with and to increase its efficiency.

Data preprocessing is the process of transforming raw data into a clean and usable format for machine learning models. It involves various steps such as handling missing values, encoding categorical data, splitting the data into training and test sets, and scaling features to ensure that the data is properly structured for analysis. Proper data preprocessing helps improve the model's accuracy and overall performance.

1. Importing the Libraries and Dataset

The first step is to import necessary Python libraries and load the dataset. The dataset we are working with is from a retail company doing an analysis on clients who purchased their products.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

dataset = pd.read_csv('Data.csv')  # Converts the CSV file into a pandas DataFrame

# Features and dependent variable

x = dataset.iloc[:, :-1].values  # Matrix of features (all columns except the last one)
y = dataset.iloc[:, -1].values   # Vector of dependent variable (last column)

Dataset Preview:

Retail Company doing analysis on which clients purchased one of their products, and their age, salary and Country provided.

Country	Age	Salary	Purchased
France	44	72000	No
Spain	27	48000	Yes
Germany	30	54000	No
Spain	38	61000	No
Germany	40		Yes
France	35	58000	Yes
Spain		52000	No
France	48	79000	Yes
Germany	50	83000	No
France	37	67000	Yes

2. Handling Missing Data

Missing data can skew the model, so it’s important to handle it appropriately. Here, we replace missing values with the mean of the column.

from sklearn.impute import SimpleImputer

imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(x[:, 1:3])  # Applies to 'Age' and 'Salary' columns

x[:, 1:3] = imputer.transform(x[:, 1:3])  # Replaces missing values

imputer.transform returns the columns and rows after replacing the missing values with the mean of the other values in the column. Row 7 in the Age column will be replaced by the mean values of all the other Age in the column and Row 5 in the Salary column will be replaced by the mean values of the other Salaries.

3. Encoding Categorical Data

Many machine learning models work with numerical data, so we need to encode categorical variables. We use One Hot Encoding for the 'Country' column and Label Encoding for the 'Purchased' column (Yes/No).

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

# Encode 'Country' with One Hot Encoding
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
x = np.array(ct.fit_transform(x))

# Encode 'Purchased' with Label Encoding
le = LabelEncoder()
y = le.fit_transform(y)  # Yes=1, No=0

What is One Hot Encoding and why is it used?

One-hot encoding is a method of representing categorical data by turning the number of categories into binary vectors. For example, if there are three categories, one-hot encoding will create three columns. The first category will be represented as 100, the second as 010, and the third as 001. This approach ensures that each category is distinct and avoids implying any numerical order or ranking between them, which would happen if you assigned them simple numeric values like 1, 2, or 3.

4. Splitting the Dataset into Training and Test Sets

Splitting the dataset is a crucial step to evaluate the performance of your model. Typically, 80% of the data is used for training, and 20% is set aside for testing.

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1)

Why Split the Data?

The training set is used to build the model, while the test set evaluates the model’s ability to make predictions. Since the model has no prior access to the test set, it allows us to see how well the model generalizes to unseen data.

5. Feature Scaling

Feature scaling is a technique used in data preprocessing to ensure that different features (columns) have values on a similar scale. This is important because many machine learning algorithms rely on distance calculations, and features with larger values can dominate the model, leading to biased results.

Applying Feature Scaling After Splitting:

Feature scaling is applied after splitting the data into training and test sets to prevent data leakage. We fit the scaler to the training data and apply it to the test set using the same scaling factors.

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
x_train[:, 3:] = sc.fit_transform(x_train[:, 3:])
x_test[:, 3:] = sc.transform(x_test[:, 3:])

We only transform the test set using the same scaler fitted on the training set to avoid information leakage from the test data.

Conclusion

Data preprocessing is a critical step in the machine learning pipeline. It ensures that the data is clean, consistent, and ready for model training. By handling missing values, encoding categorical data, splitting the dataset, and applying feature scaling, you can significantly improve the accuracy and performance of your machine learning models. In the next part of this series, we’ll begin with linear regression so stay tuned!