Linear Regression

Linear Regression is a linear approximation of causal relationship between two or more variables.

Process –
1 – Get sample data
2 – Design a model that works on that sample
3 – Make predictions for the whole population
Dependent vairable (Predicted) – Y
Independent Variable(predictors) – x1, x2…. xn
Y = F(x1, x2…. xn)
We look for the casualtiy in that case.
For example the salary is related to the number of years studied by an individual.
y = B + 5000(x) + E
y = B + 0 + E
y – Income
B – Minimum wage
Correlation vs Causality
Correlation does not mean causality
Causality is one way whereas correlarion is both way.

1)Sum of squares total(SST)or Total Sum of Squares(SST) It is the measurement of the total vasriability around mean.

2)Sum of squares regression(SSR) or Explained Sum of Squares(ESS) It is the total variability of the predicted values around mean. If SSR is equal to SST, the model defines all the variaility and is perfect.

3)Sum of squares error(SSE) or Residual Sum of Squares(RSS)

SST = SSR + SSE Lesser the SSE, more is the variability explained by the regression model.

R Squared(Goodness of fit) – To know how powerful the regression line is we use a statistic called R squared. It is defined as the ratio of the variability explained by the regression to that of the total variabilty of the dataset. R Squared = SSR/SST

In [2]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt 
sns.set_style('whitegrid')
%matplotlib inline
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
from sklearn.datasets import load_boston
In [3]:
boston = load_boston()
print(boston.DESCR)
.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
.. topic:: References

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.

In [4]:
plt.hist(boston.target, bins=50)
plt.xlabel('Price of houses in $1000s')
plt.ylabel('Number of houses')
Out[4]:
Text(0, 0.5, 'Number of houses')
In [5]:
#Let us have a scatter plot of a single feature with the target.
#We will have price of the housing with the numnber of rooms in the dweling.
plt.scatter(boston.data[:,5],boston.target)
plt.ylabel('Price in $1000s')
plt.xlabel('Number of rooms')
#We can see a positive correlation between rooms and price of house.
Out[5]:
Text(0.5, 0, 'Number of rooms')
In [6]:
boston_df = pd.DataFrame(boston.data)
boston_df.columns = boston.feature_names
boston_df.head()
Out[6]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33
In [102]:
boston_df['Price'] = boston.target
boston_df.head()
Out[102]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT Price
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2
In [8]:
from IPython.display import Image
url = 'https://upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Linear_least_squares_example2.svg/200px-Linear_least_squares_example2.svg.png'
Image(url)
#A plot of the data points (in red), the least squares line of best fit (in blue), and the residuals (in green)
#The line with least amount of variation on y axis will be the best fit line.
Out[8]:
In [9]:
X = boston_df.RM
In [93]:
X = np.vstack(boston_df.RM)
X.shape
Out[93]:
(506, 1)
In [94]:
Y = np.array(boston_df.Price)
Y.shape
Out[94]:
(506,)
In [66]:
#X = np.array( [ [ value,1] for value in X])

#X
Out[66]:
array([[array([6.575]), 1],
       [array([6.421]), 1],
       [array([7.185]), 1],
       ...,
       [array([6.976]), 1],
       [array([6.794]), 1],
       [array([6.03]), 1]], dtype=object)
In [95]:
result = np.linalg.lstsq(X,Y)
error_total = result[1]
rmse = np.sqrt(error_total/len(X))
rmse
/Users/sourabh/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: FutureWarning: `rcond` parameter will change to the default of machine precision times ``max(M, N)`` where M and N are the input matrix dimensions.
To use the future default and silence this warning we advise to pass `rcond=None`, to keep using the old, explicitly pass `rcond=-1`.
  """Entry point for launching an IPython kernel.
Out[95]:
array([7.64268509])
In [32]:
plt.plot()
Out[32]:
[]
In [ ]:
#Now using Scikit Learn
In [119]:
import sklearn
from sklearn.linear_model import LinearRegression
lreg = LinearRegression()
X_multi = boston_df.drop('Price',1)
Y_target = boston_df.Price
#lreg.fit(X_multi, Y_target)
#print('Intercept is : {}'.format(lreg.intercept_))
#print('Number of coefficients are : {}'.format(len(lreg.coef_)))
In [108]:
coeff_df = pd.DataFrame(boston_df.columns)
coeff_df.columns = ['Features']
coeff_df['Coefficient Estimate'] = pd.Series(lreg.coef_) 
coeff_df
Out[108]:
Features Coefficient Estimate
0 CRIM -0.108011
1 ZN 0.046420
2 INDUS 0.020559
3 CHAS 2.686734
4 NOX -17.766611
5 RM 3.809865
6 AGE 0.000692
7 DIS -1.475567
8 RAD 0.306049
9 TAX -0.012335
10 PTRATIO -0.952747
11 B 0.009312
12 LSTAT -0.524758
13 Price NaN
In [128]:
X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split(X, boston_df.Price)
lreg.fit(X_train, Y_train)

print(X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)
print('Intercept is : {}'.format(lreg.intercept_))
print('Number of coefficients are : {}'.format(len(lreg.coef_)))
pred_train = lreg.predict(X_train)
pred_test = lreg.predict(X_test)
df_comp = pd.DataFrame({'Actual': Y_test, 'Predicted':pred_test})
#df_comp

plt.scatter(X_test, Y_test,  color='gray')
plt.plot(X_test, pred_test, color='red', linewidth=2)
plt.show()
(379, 1) (127, 1) (379,) (127,)
Intercept is : -37.855302800021974
Number of coefficients are : 1