Linear Regression is a linear approximation of causal relationship between two or more variables.
Process –1 – Get sample data
2 – Design a model that works on that sample
3 – Make predictions for the whole population
Dependent vairable (Predicted) – Y
Independent Variable(predictors) – x1, x2…. xn
Y = F(x1, x2…. xn)
We look for the casualtiy in that case.
For example the salary is related to the number of years studied by an individual.
y = B + 5000(x) + E
y = B + 0 + E
y – Income
B – Minimum wage
Correlation vs Causality
Correlation does not mean causality
Causality is one way whereas correlarion is both way.
1)Sum of squares total(SST)or Total Sum of Squares(SST) It is the measurement of the total vasriability around mean.
2)Sum of squares regression(SSR) or Explained Sum of Squares(ESS) It is the total variability of the predicted values around mean. If SSR is equal to SST, the model defines all the variaility and is perfect.
3)Sum of squares error(SSE) or Residual Sum of Squares(RSS)
SST = SSR + SSE Lesser the SSE, more is the variability explained by the regression model.
R Squared(Goodness of fit) – To know how powerful the regression line is we use a statistic called R squared. It is defined as the ratio of the variability explained by the regression to that of the total variabilty of the dataset. R Squared = SSR/SST
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_style('whitegrid')
%matplotlib inline
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
from sklearn.datasets import load_boston
boston = load_boston()
print(boston.DESCR)
plt.hist(boston.target, bins=50)
plt.xlabel('Price of houses in $1000s')
plt.ylabel('Number of houses')
#Let us have a scatter plot of a single feature with the target.
#We will have price of the housing with the numnber of rooms in the dweling.
plt.scatter(boston.data[:,5],boston.target)
plt.ylabel('Price in $1000s')
plt.xlabel('Number of rooms')
#We can see a positive correlation between rooms and price of house.
boston_df = pd.DataFrame(boston.data)
boston_df.columns = boston.feature_names
boston_df.head()
boston_df['Price'] = boston.target
boston_df.head()
from IPython.display import Image
url = 'https://upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Linear_least_squares_example2.svg/200px-Linear_least_squares_example2.svg.png'
Image(url)
#A plot of the data points (in red), the least squares line of best fit (in blue), and the residuals (in green)
#The line with least amount of variation on y axis will be the best fit line.
X = boston_df.RM
X = np.vstack(boston_df.RM)
X.shape
Y = np.array(boston_df.Price)
Y.shape
#X = np.array( [ [ value,1] for value in X])
#X
result = np.linalg.lstsq(X,Y)
error_total = result[1]
rmse = np.sqrt(error_total/len(X))
rmse
plt.plot()
#Now using Scikit Learn
import sklearn
from sklearn.linear_model import LinearRegression
lreg = LinearRegression()
X_multi = boston_df.drop('Price',1)
Y_target = boston_df.Price
#lreg.fit(X_multi, Y_target)
#print('Intercept is : {}'.format(lreg.intercept_))
#print('Number of coefficients are : {}'.format(len(lreg.coef_)))
coeff_df = pd.DataFrame(boston_df.columns)
coeff_df.columns = ['Features']
coeff_df['Coefficient Estimate'] = pd.Series(lreg.coef_)
coeff_df
X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split(X, boston_df.Price)
lreg.fit(X_train, Y_train)
print(X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)
print('Intercept is : {}'.format(lreg.intercept_))
print('Number of coefficients are : {}'.format(len(lreg.coef_)))
pred_train = lreg.predict(X_train)
pred_test = lreg.predict(X_test)
df_comp = pd.DataFrame({'Actual': Y_test, 'Predicted':pred_test})
#df_comp
plt.scatter(X_test, Y_test, color='gray')
plt.plot(X_test, pred_test, color='red', linewidth=2)
plt.show()