Detailed explanation of Linear Regression

 

1.1 Introduction

In a data set we can characterize features or variables as either quantitative or qualitative (also known as categorical). Quantitative variables are nothing but numerical values like a person’s weight or temperature of a city and qualitative variables are values in one of ’n’ different classes, or categories like gender (male or female), different blog categories(technical, cooking, fashion etc.,). We tend to refer to problems with a quantitative response as regression problems. The response variable here is referred to as target or dependent variable and the other independent variables are predictors.

Image for post
(read “≈” as “is approximately modeled as”)
Image for post
Image for post
Image for post
Image for post

1.2 Validation of Estimated Coefficients

  1. Standard Errors associated with ˆβ0 and ˆβ1,
Image for post
Image for post
Image for post
Image for post

1.3 Assessing Model Using Metrics

These metrics are useful in estimating the accuracy of coefficients. So now we can model with updated coefficients or features and evaluate the accuracy of this model. The extent of fit of linear regression is generally assessed with two Metrics
1. RSE can be defined in different terminologies
— The RSE is an estimate of the standard deviation of error
— The average value that the dependent variable deviated from the true-regression line or
— Lack of fit of the model
2. R-Squared statistic
— As RSE is measured in the units of Y we are never sure of what value is a good RSE. But R-squared is measured as proportion of variability in Y that can be explained using X and always will be range of 0 to 1 unlike RSE.
— Formula of R-squared is

Image for post
Image for post

1.4 Assumptions in Linear Regression

The regression has five key assumptions:

1.5 Feature Engineering

As we talked about collinearity ,there are a few points to be marked. Collinearity of variables is found by plotting a co relation matrix and we eliminate one of the correlated variables that do not add any value to the model. After eliminating the them reconfigure the correlation matrix and continue eliminating till all the variables are independent of each other.

1.6 Stochastic gradient Descent (SGD)

We can measure the accuracy or how good the model is fit with the measure Mean Squared Error(MSE) which calculates the mean of squared terms of difference between actual and predicted values

Image for post
Image for post
Image for post

2. Python Tutorial on Linear Regression

Let’s get into the practice session in python using beer consumption dataset that has temperatures of a particular day, rainfall measure,weekend or not and final response variable consumption of beer in liters. All the dependencies are resided in the top lines

import numpy as np 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error,r2_score
beer_data=pd.read_csv(“beer_consumption_data.csv”) #read csv data #into a dataframe using pd.read_csvbeer_data.head(10) #head() prints top 5 rows in the data set
Image for post
Output of the above snippet
beer_data.columns=[“Date”,”Temperature_Median”,”Temperature_Min”,”Temperature_Max”,”Rainfall”,”Weekend”,”Consumption_litres”]
beer_data[‘Temperature_Median’] = beer_data[‘Temperature_Median’].str.replace(‘,’, ‘.’).astype(‘float’)
beer_data[‘Temperature_Min’] = beer_data[‘Temperature_Min’].str.replace(‘,’, ‘.’).astype(‘float’)
beer_data[‘Temperature_Max’] = beer_data[‘Temperature_Max’].str.replace(‘,’, ‘.’).astype(‘float’)
beer_data[‘Rainfall’] = beer_data[‘Rainfall’].str.replace(‘,’, ‘.’).astype(‘float’)
beer_data.info() #info() outputs total number of rows,number of #columns and null values present in each of them.
Image for post
.info() gives output as above
#drop Blank rows read from the input CSV and describe shows all #statistics 
beer_data = beer_data.dropna()
beer_data.describe()
Image for post
X = beer_data.drop(columns=[‘Date’, ‘Consumption_litres’])
Y = beer_data[‘Consumption_litres’]
plt.figure(figsize=(7,7))
sns.heatmap(X.corr())
plt.title(“Correlation Heatmap”)
plt.show()
Image for post
vif = pd.DataFrame() #Let us show th VIF scores in a data frame
vif[‘Features’] = X.columns
vif[‘VIF Factor’] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])] #variance_inflation_factor calculates the scores #for each Feature
vif
Image for post
Output of above snippet
#If we write a function then we do not need to re run same set of lines all the time.After checking VIF scores we give the column name with high VIF score as an argument in this function and it is dropped form the dataframedef check_vif_drop_column(X,column_name): 
X = X.drop(columns=column_name)
vif = pd.DataFrame()
vif[‘Features’] = X.columns
vif[‘VIF Factor’] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
return vif,X
vif1,X = check_vif_drop_column(X,’Temperature_Median’)
vif1
Image for post
Output of above snippet
vif2,X = check_vif_drop_column(X,’Temperature_Min’)
vif2
Image for post
output of above snippet
def split_train_data(X,Y):
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25)
return(X_train, X_test, Y_train, Y_test)
X_train, X_test, Y_train, Y_test = split_train_data(X,Y)
#Arguments will be the model used for training and train data.We can change this function according to the problem statement and requirement( remember to change it in argument too :P)def model_fit(model,X_train, Y_train):
model = LinearRegression()
model.fit(X_train, Y_train)
return model
lin_model = model_fit(LinearRegression,X_train, Y_train)
def scores_(model,X,Y):
y_predict = model.predict(X)
rmse = (np.sqrt(mean_squared_error(Y, y_predict)))
r2 = r2_score(Y, y_predict)
print(‘RMSE is {}’.format(rmse))
print(‘R2 score is {}’.format(r2))
print("The model performance of training set")
scores_(lin_model,X_train,Y_train)
print("--------------------------------------")
print("The model performance of testing set")
scores_(lin_model,X_test,Y_test)
Image for post
Decent Values of R2 score

3. Final Code

import numpy as np 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error,r2_score
beer_data=pd.read_csv(“beer_consumption_data.csv”)beer_data.columns=[“Date”,”Temperature_Median”,”Temperature_Min”,”Temperature_Max”,”Rainfall”,”Weekend”,”Consumption_litres”]beer_data[‘Temperature_Median’] = beer_data[‘Temperature_Median’].str.replace(‘,’, ‘.’).astype(‘float’)
beer_data[‘Temperature_Min’] = beer_data[‘Temperature_Min’].str.replace(‘,’, ‘.’).astype(‘float’)
beer_data[‘Temperature_Max’] = beer_data[‘Temperature_Max’].str.replace(‘,’, ‘.’).astype(‘float’)
beer_data[‘Rainfall’] = beer_data[‘Rainfall’].str.replace(‘,’, ‘.’).astype(‘float’)
beer_data = data.dropna()X = beer_data.drop(columns=[‘Date’, ‘Consumption_litres’])
Y = beer_data[‘Consumption_litres’]
plt.figure(figsize=(7,7))
sns.heatmap(X.corr())
plt.title(“Correlation Heatmap”)
plt.show()
vif = pd.DataFrame() #Let us show th VIF scores in a data frame
vif[‘Features’] = X.columns
vif[‘VIF Factor’] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif)
def check_vif_drop_column(X,column_name):
X = X.drop(columns=column_name)
vif = pd.DataFrame()
vif[‘Features’] = X.columns
vif[‘VIF Factor’] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
return vif,X
vif1,X = check_vif_drop_column(X,’Temperature_Median’)
print(vif1)
vif2,X = check_vif_drop_column(X,’Temperature_Median’)
print(vif2)
#Modellingdef split_train_data(X,Y):
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25)
return(X_train, X_test, Y_train, Y_test)
def model_fit(LinearRegression,X_train, Y_train):
lin_model = LinearRegression()
lin_model.fit(X_train, Y_train)
return lin_model
def scores_(lin_model,X,Y):
y_predict = lin_model.predict(X)
rmse = (np.sqrt(mean_squared_error(Y, y_predict)))
r2 = r2_score(Y, y_predict)
print(‘RMSE is {}’.format(rmse))
print(‘R2 score is {}’.format(r2))

X_train, X_test, Y_train, Y_test = split_train_data(X,Y)
lin_model = model_fit(LinearRegression,X_train, Y_train)
print(“The model performance of training set”)
scores_(lin_model,X_train,Y_train)
print(“ — — — — — — — — — — — — — — — — — — — “)
print(“The model performance of testing set”)
scores_(lin_model,X_test,Y_test)

Comments

Popular posts from this blog

K-Nearest Neighbors in Python + Hyperparameters Tuning

Demystifying ‘Confusion Matrix’ Confusion

Anomaly Detection Using Isolation Forest in Python