Regression analysis is a statistical process used to estimate the relationships between the dependent variable and one or more independent variables. Regression analysis is mostly used for prediction and forecasting which overlaps with machine learning. In this task we will experiment some linear regression use case.
The objective of LinearRegression is to fit a linear model to the dataset by adjusting a set of parameters in order to make the sum of the squared residuals of the model as small as possible.
A linear model is defined by: y = b + bx, where y is the target variable, X is the data, b represents the coefficients.
Let's try and predict something using linear regression.
The Salary dataset consists of two variables [YearsExperience, Salary], The goal is to predict the salary one is going to get using the years of experience.
We will kick off with a very famous data set loved by machine learning practitioner...
Let's get to know how Data and have fun with.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
iris_df = pd.read_csv('iris.csv')
iris_df.head()
data = load_iris()
data.feature_names #feature can be refere to as column but is a term(i think) that refere to independents var
data.target_names #and over here we have names of species or our target, the dependent values.
data.target # over here we see that by calling target on the dataset we get the number representations
# or dummy representatives of the values in the dependent column
X = data.data #data basically refere to the values in the independent columns
X.shape #check the shape hapy with that
y = data.target # collecting the number represatation of the independent values
y.shape #check the shape... not happy. let's reshape to 2D
# because sklearn doesn't like 1D arrays or vectors we're going to reshape it
y = y.reshape(-1, 1)
y.shape # get it to 2D
plt.figure(figsize=(18,8),dpi=100) #set the canvas size for visibility
plt.scatter(X.T[0],X.T[2]) #over here I use the T ndarray method to transpose the data then get columns at index 0 and 2
plt.title('IRIS Petal and sepal length', fontsize=20) # set the title of the plot and adjust my font size for readability
#then we set the label (just to be obvious)
plt.ylabel('Petal Length')
plt.xlabel('sepal length')
We can't really see how the iris are grouped but we can clearly see that there a linear relationship here
from sklearn.model_selection import train_test_split #the tool for split the data
from sklearn.linear_model import LinearRegression #and because we know we going to use linear regression for our prediction we import the class as well
#over here we split the data. into the x&y trainer and y&x tester
X_train,X_test,y_train,y_test = train_test_split(X,y, test_size = 0.20)
lr = LinearRegression() #create our linear model
#fitting the model on the training data and try to predict the X_test
iris_model = lr.fit(X_train, y_train)
predictions = iris_model.predict(X_test)
#plotting the error in our in our predicitions
plt.errorbar(range(1, len(y_test)+1), y_test, yerr=(y_test-predictions), fmt='^k', ecolor='red')
from sklearn.metrics import r2_score #class will help us to calculate and see the score of our predictions
r2_score(y_test, predictions)
#so over to get the RMSE we first get the distance between the y_test and the prediction then we elavated it to the power of **2
#after we get the average number and finally use the numpy square root function.
np.sqrt(((predictions - y_test)**2).mean())
#Importing the dataset
data = pd.read_csv("Advertising.csv")
data.head()
x = data.iloc[:,2].values
y = data.iloc[:,4].values
print(x.shape, y.shape)
y=y.reshape(-1, 1)
x=x.reshape(-1, 1)
print(x.shape, y.shape)
# import from sklearn the linear regression model that will help us in this analysis.
from sklearn.linear_model import LinearRegression
# create an empty linear regression model like below and give it a good variable name
radio_model = LinearRegression()
# to create the model, we use fit(x,y)
radio_model.fit(x,y)
y_pred = radio_model.predict(x)
plt.scatter(x,y,color = 'b')
plt.plot(x,radio_model.predict(x),color = 'r')
plt.title('Sales v/s Radio Budget')
plt.xlabel('Sales')
plt.ylabel('Radio Budget')
plt.show()
# The coefficients
print('Coefficients: \n', radio_model.coef_)
import seaborn as sns
wage_df = pd.read_csv('hourlywagedata.csv')
wage_df.head()
wage_df.info()
The Hourwage colon has type object to be able to do calcutions on the hourwage we need to convert it to float and for that we're going to use the to_numeric function this one is effective because it can pick up if there's non numeric values in the column and can convert it to Null
wage_df['hourwage'] = pd.to_numeric(wage_df['hourwage'], errors='coerce')
# Now the hourwage column is set to float we can even check that
wage_df.info()
wage_df.isnull().sum() #after the conversation we check if the data set still free of missing values (which is not the case)
For simplicity we're just going to drop the rows with missing values... also in this case it can be the best decision considering the percentage of missing values (89rows/3000rows)
wage_df.dropna(inplace = True)#we clean the df by droping the rows with missing values
wage_df.shape
Let's do a short data exploration to see if we can discover trends to our data before even doing the predictions
plt.figure(figsize=(20,7))#over here we create a ploting area and set the size
#then Generate three plots showing the average hourly wage against the three categorical independent variables.
for i,col,e in zip(range(1,4),['b','r','g'] , wage_df.columns):
print(i,e)
plt.subplot(1,3,i)
wage_df['hourwage'].groupby(wage_df[e]).mean().plot(kind='bar',color=col)
plt.title(e+' VS Hourly wage')
plt.ylabel('hourly wage')
From the Three plots above we can easily say that people older people are more likely to earn morethan younger ones... and especialy if they are on position 0... let's
To be even more accurate we're going to do a Multiple linear regression using the sklearn on the data set
firt we're going to collect the data in the format that will be easy to fit in a linear model
# colecting the necessary variables x as independests and y as dependent
X = wage_df.iloc[:, 0:3].values
y = wage_df.iloc[:,-1].values.reshape(-1,1)
print(X.shape, y.shape)
We're going to take this simple steps to predict the hourlywage of a person
#over here we split the data into training and testing groups
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)
linearRegr = LinearRegression()
model = linearRegr.fit(x_train, y_train)
wage_predic = model.predict(x_test)
#wage_predic
We're going to take the following steps To evaluate our model
plt.figure(figsize=(18,9))
plt.errorbar(range(1, len(y_test)+1), y_test, yerr=(y_test-wage_predic), fmt='.k', ecolor='red')
from sklearn.metrics import r2_score #class will help us to calculate and see the score of our predictions
r2_score(y_test, wage_predic)
#so over to get the RMSE we first get the distance between the y_test and the prediction then we elavated it to the power of **2
#after we get the average number and finally use the numpy square root function.
np.sqrt(((wage_predic - y_test)**2).mean())
print('Intercept: \n', model.intercept_)
print('Coefficients: \n', model.coef_)
Throughout this article we demonstrate how we can use sklearn's linear Regression model to predict continuous variables... in the next article we're going to see how we can do classification with sklearn.
Please leave me a comment if you have any question or if you would like me to explain in details a particular point.