Linear Regression

Regression analysis is a statistical process used to estimate the relationships between the dependent variable and one or more independent variables. Regression analysis is mostly used for prediction and forecasting which overlaps with machine learning. In this task we will experiment some linear regression use case.

The objective of LinearRegression is to fit a linear model to the dataset by adjusting a set of parameters in order to make the sum of the squared residuals of the model as small as possible.

A linear model is defined by: y = b + bx, where y is the target variable, X is the data, b represents the coefficients.

Let's try and predict something using linear regression.

The Salary dataset consists of two variables [YearsExperience, Salary], The goal is to predict the salary one is going to get using the years of experience.

We will kick off with a very famous data set loved by machine learning practitioner...

Let's get to know how Data and have fun with.

IRIS DATA SET LINEAR REGRESSION

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
In [2]:
iris_df = pd.read_csv('iris.csv')
iris_df.head()
Out[2]:
Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species
0 1 5.1 3.5 1.4 0.2 Iris-setosa
1 2 4.9 3.0 1.4 0.2 Iris-setosa
2 3 4.7 3.2 1.3 0.2 Iris-setosa
3 4 4.6 3.1 1.5 0.2 Iris-setosa
4 5 5.0 3.6 1.4 0.2 Iris-setosa

we load the iris dataset directly online (in case you do not have it you can import it from skearlm)

In [3]:
data = load_iris() 
data.feature_names #feature can be refere to as column but is a term(i think) that refere to independents var
Out[3]:
['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']
In [4]:
data.target_names #and over here we have names of species or our target, the  dependent values.
Out[4]:
array(['setosa', 'versicolor', 'virginica'], dtype='<U10')
In [5]:
data.target # over here we see that by calling target on the dataset we get the number representations
            # or dummy representatives of the values in the dependent column
Out[5]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
In [ ]:
 
In [102]:
X = data.data #data basically refere to the values in the independent columns
X.shape      #check the shape hapy with that
Out[102]:
(150, 4)
In [103]:
y = data.target  #  collecting the number represatation of the independent values
y.shape      #check the shape... not happy. let's reshape to 2D
Out[103]:
(150,)
In [104]:
# because sklearn doesn't like 1D arrays or vectors we're going to reshape it
y = y.reshape(-1, 1)
y.shape               # get it to 2D
Out[104]:
(150, 1)

we're going to plot the lenght of sepal and petal to check if the data Linear

In [81]:
plt.figure(figsize=(18,8),dpi=100)   #set the canvas size for visibility

plt.scatter(X.T[0],X.T[2])   #over here I use the T ndarray method to transpose the data then get columns at index 0 and 2
plt.title('IRIS Petal and sepal length', fontsize=20) # set the title of the plot and adjust my font size for readability

#then we set the label (just to be obvious)
plt.ylabel('Petal Length') 
plt.xlabel('sepal length')
Out[81]:
Text(0.5, 0, 'sepal length')

We can't really see how the iris are grouped but we can clearly see that there a linear relationship here

Let's start the prediction

We're going to take this simple steps to predict the hourlywage of a person

  • split the data using the train_test_split() method from skitlearn
  • Then we going to build the model and fit (train) it using our train data
  • Last but not least we're going to generate prediction
In [105]:
from sklearn.model_selection import train_test_split    #the tool for split the data 
from sklearn.linear_model import  LinearRegression      #and because we know we going to use linear regression for our prediction we import the class as well 


#over here we split the data. into the x&y trainer and y&x tester
X_train,X_test,y_train,y_test = train_test_split(X,y, test_size = 0.20) 
In [106]:
lr = LinearRegression()        #create our linear model

#fitting the model on the training data and try to predict the X_test
iris_model = lr.fit(X_train, y_train)
predictions = iris_model.predict(X_test)
In [48]:
#plotting the error in our in our predicitions
plt.errorbar(range(1, len(y_test)+1), y_test, yerr=(y_test-predictions), fmt='^k', ecolor='red')
Out[48]:
<ErrorbarContainer object of 3 artists>
In [50]:
from sklearn.metrics import r2_score   #class will help us to calculate and see the score of our predictions

r2_score(y_test, predictions)
Out[50]:
0.904901491129183
In [51]:
#so over to get the RMSE we first get the distance between the y_test and the prediction then we elavated it to the power of **2 
#after we get the average number and finally use the numpy square root function.
np.sqrt(((predictions - y_test)**2).mean()) 
Out[51]:
0.24520071494252943

Radio Simple Linear Model

In [111]:
#Importing the dataset
data = pd.read_csv("Advertising.csv")
data.head()
Out[111]:
Unnamed: 0 TV radio newspaper sales
0 1 230.1 37.8 69.2 22.1
1 2 44.5 39.3 45.1 10.4
2 3 17.2 45.9 69.3 9.3
3 4 151.5 41.3 58.5 18.5
4 5 180.8 10.8 58.4 12.9
In [112]:
x = data.iloc[:,2].values
y = data.iloc[:,4].values
print(x.shape, y.shape)
(200,) (200,)
In [113]:
y=y.reshape(-1, 1)
x=x.reshape(-1, 1)
print(x.shape, y.shape)
(200, 1) (200, 1)
In [114]:
# import from sklearn the linear regression model that will help us in this analysis.

from sklearn.linear_model import LinearRegression

# create an empty linear regression model like below and give it a good variable name
radio_model = LinearRegression()

# to create the model, we use fit(x,y)
radio_model.fit(x,y)


y_pred = radio_model.predict(x)
plt.scatter(x,y,color = 'b')
plt.plot(x,radio_model.predict(x),color = 'r')
plt.title('Sales v/s Radio Budget')
plt.xlabel('Sales')
plt.ylabel('Radio Budget')
plt.show()
In [115]:
# The coefficients
print('Coefficients: \n', radio_model.coef_)
Coefficients: 
 [[0.20249578]]

Multiple Linear Regression on hourlywage Data set

In [53]:
import seaborn as sns
wage_df = pd.read_csv('hourlywagedata.csv')
wage_df.head()
Out[53]:
position agerange yrsscale hourwage
0 1 1 2 13.736234054538
1 0 1 2 16.4407309689108
2 0 1 3 21.3891077239505
3 1 1 1 11.377187468408
4 0 1 3 21.5607775454338
In [54]:
wage_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 4 columns):
position    3000 non-null int64
agerange    3000 non-null int64
yrsscale    3000 non-null int64
hourwage    3000 non-null object
dtypes: int64(3), object(1)
memory usage: 93.8+ KB

The Hourwage colon has type object to be able to do calcutions on the hourwage we need to convert it to float and for that we're going to use the to_numeric function this one is effective because it can pick up if there's non numeric values in the column and can convert it to Null

In [55]:
wage_df['hourwage'] = pd.to_numeric(wage_df['hourwage'], errors='coerce') 
In [56]:
# Now the hourwage column is set to float we can even check that
wage_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 4 columns):
position    3000 non-null int64
agerange    3000 non-null int64
yrsscale    3000 non-null int64
hourwage    2911 non-null float64
dtypes: float64(1), int64(3)
memory usage: 93.8 KB

check for missing values and dealing with them

In [57]:
wage_df.isnull().sum() #after the conversation we check if the data set still free of missing values (which is not the case)
                        
Out[57]:
position     0
agerange     0
yrsscale     0
hourwage    89
dtype: int64

For simplicity we're just going to drop the rows with missing values... also in this case it can be the best decision considering the percentage of missing values (89rows/3000rows)

In [59]:
wage_df.dropna(inplace = True)#we clean the df by droping the rows with missing values
In [60]:
wage_df.shape
Out[60]:
(2911, 4)

Exploratory data analysis

Let's do a short data exploration to see if we can discover trends to our data before even doing the predictions

In [64]:
plt.figure(figsize=(20,7))#over here we create a ploting area and set the size


#then Generate three plots showing the average hourly wage against the three categorical independent variables.

for i,col,e in zip(range(1,4),['b','r','g'] , wage_df.columns):
    print(i,e)
    
    plt.subplot(1,3,i)
    
    wage_df['hourwage'].groupby(wage_df[e]).mean().plot(kind='bar',color=col)
    plt.title(e+' VS Hourly wage')
    plt.ylabel('hourly wage')
1 position
2 agerange
3 yrsscale

From the Three plots above we can easily say that people older people are more likely to earn morethan younger ones... and especialy if they are on position 0... let's

To be even more accurate we're going to do a Multiple linear regression using the sklearn on the data set

firt we're going to collect the data in the format that will be easy to fit in a linear model

In [35]:
# colecting the necessary variables x as independests and y as dependent
X = wage_df.iloc[:, 0:3].values
y = wage_df.iloc[:,-1].values.reshape(-1,1)
print(X.shape, y.shape)
(2911, 3) (2911, 1)

We're going to take this simple steps to predict the hourlywage of a person

  • split the data using the train_test_split() method from skitlearn
  • Then we going to build the model and fit (train) it using our train data
  • Last but not least we're going to generate prediction
In [66]:
#over here we split the data into training and testing groups

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)

linearRegr = LinearRegression()


model = linearRegr.fit(x_train, y_train)
wage_predic = model.predict(x_test)
#wage_predic

Now onto the results

We're going to take the following steps To evaluate our model

  • Plot the errorbar this will give us a visual idea of how our model perform
  • Use the r2_score from skitlearn to get the score in percentage of our model performance
  • Get the RMSE or root mean square error
  • we're also going to print the Coeficient and intercept of our model
In [68]:
plt.figure(figsize=(18,9))
plt.errorbar(range(1, len(y_test)+1), y_test, yerr=(y_test-wage_predic), fmt='.k', ecolor='red')
Out[68]:
<ErrorbarContainer object of 3 artists>
In [73]:
from sklearn.metrics import r2_score   #class will help us to calculate and see the score of our predictions

r2_score(y_test, wage_predic)
Out[73]:
0.8697818006651469
In [74]:
#so over to get the RMSE we first get the distance between the y_test and the prediction then we elavated it to the power of **2 
#after we get the average number and finally use the numpy square root function.
np.sqrt(((wage_predic - y_test)**2).mean()) 
Out[74]:
0.25910367828636527
In [75]:
print('Intercept: \n', model.intercept_)
print('Coefficients: \n', model.coef_)
Intercept: 
 0.15635581712880287
Coefficients: 
 [-0.11731752 -0.02975806  0.24581712  0.59054998]

Conclusion

Throughout this article we demonstrate how we can use sklearn's linear Regression model to predict continuous variables... in the next article we're going to see how we can do classification with sklearn.

Please leave me a comment if you have any question or if you would like me to explain in details a particular point.

In [ ]: