Building a ml model - part 3

Aryan Tah
Dec 30, 2024
4 min read

Visualization for Multiple Linear Regression Model

Introduction to Regression

Regression analysis is a powerful tool for investigating the relationship between dependent and independent variables. In machine learning, regression models are used to predict continuous outcomes. In this post, we will cover Simple Linear Regression and Multiple Linear Regression—two common regression techniques used to build predictive models.

Simple Linear Regression

Simple Linear Regression is a statistical method used to model the relationship between two variables: one independent (predictor) variable and one dependent (response) variable. The objective is to find a straight line that best fits the data, which allows us to predict the dependent variable based on the independent variable.

We should consider whether we want to build the linear regression model with the outlier or after removing it from the dataset. This decision solely depends on the knowledge about the dataset.

Formula

Example:

Predicting Salary Based on Experience

Let's walk through building a simple linear regression model to predict a person's salary based on their years of experience.

1)Preprocessing the Data:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

dataset = pd.read_csv('Salary_Data.csv')

X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

2)Training the Simple Linear Regression model on the Training set

from sklearn.linear_model import LinearRegression

regressor = LinearRegression()
regressor.fit(X_train,y_train) # The fit method trains the linear regression model using the provided training data. This means it will learn the relationship between the features and dependent variable.

3)Predicting the Test set results

y_pred = regressor.predict(X_test) # produces an array of predictions made for the Test set

4) Visualizing the Training set results

plt.scatter(X_train,y_train, color = 'red')
plt.plot(X_train, regressor.predict(X_train), color ='blue') # Plotting the regression line
plt.title('Salary vs Experience (Training Set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.show()

The resulting graph:

The output is a graph where the blue line represents the predicted salaries based on years of experience.

Multiple Linear Regression

Multiple Linear Regression is an extension of simple linear regression where more than one independent variable is used to predict the dependent variable. The general formula is:

In this formula, the X's represent different predictor variables. This method is particularly useful when a single predictor is insufficient to model the complexity of the system being studied.

Example: Predicting Startup Profits

Let’s say we want to predict the profits of 50 startup companies based on their R&D, administration, and marketing expenditures.

Preprocessing:

dataset = pd.read_csv('50_Startups.csv') 
X = dataset.iloc[:, :-1].values 
y = dataset.iloc[:, -1].values 

from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder 

ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [3])], remainder='passthrough') 
X = np.array(ct.fit_transform(X)) 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

Training the Model:

regressor = LinearRegression() regressor.fit(X_train, y_train)

Predicting Test Results:

y_pred = regressor.predict(X_test)

Displaying Results:

np.set_printoptions(precision=2) print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)), axis=1))

Since we have multiple predictors, it is not feasible to visualize the results in a simple 2D plot even though we will later on in this series using three dimensional plotting offered by the library matplotlib.

Model Selection Techniques

There are several techniques to build a robust multiple regression model. Each technique has its own advantages and considerations, making them suitable for different scenarios and datasets. Below is an expanded overview of these techniques:

All-in: This technique involves including all available predictors in the model from the outset. It is often based on prior knowledge or theoretical considerations about the relationships between variables. While this approach can capture complex interactions, it may also lead to overfitting, especially if the number of predictors is large relative to the number of observations.
Backward Elimination: This method begins with a full model that includes all predictors. The model is then refined by iteratively removing the least significant predictors based on their p-values. This process continues until all remaining predictors are statistically significant. Backward elimination is useful for simplifying models but can be sensitive to the choice of significance level and may miss predictors that are significant in the presence of others.
Forward Selection: In contrast to backward elimination, forward selection starts with no predictors in the model. At each step, the most significant predictor is added based on criteria such as p-values or adjusted R-squared. This method is useful for building a model incrementally but may overlook important predictors that only show significance when combined with others.
Bidirectional Elimination: This technique combines both forward selection and backward elimination. It starts with no predictors, adds significant ones, and then removes any that become insignificant as new predictors are added. This iterative process continues until no further changes can be made. Bidirectional elimination can be more robust than either method alone but is also more computationally intensive.
All Possible Models: This exhaustive approach involves testing every possible combination of predictors to identify the best model based on a chosen criterion (e.g., AIC, BIC, adjusted R-squared). While it provides a comprehensive view of potential models, it is computationally expensive and may not be feasible for datasets with a large number of predictors due to the combinatorial explosion of possible models.

Choosing the right model selection technique depends on the specific context of the analysis, including the size of the dataset, the number of predictors, and the underlying theory guiding the model-building process. It is often beneficial to complement these techniques with cross-validation to assess model performance and avoid overfitting.

Conclusion

In this post, we explored two essential types of linear regression: simple and multiple. These models form the foundation of many more advanced machine learning techniques. Simple linear regression is ideal for problems with one predictor variable, while multiple linear regression accommodates more complex relationships between multiple predictors and a dependent variable.

Stay tuned for the next part of the series where we delve deeper into the nuances of regression analysis and explore more sophisticated models.