Linear Regression is a linear approximation of causal relationship between two or more variables.

Process –
1 – Get sample data
2 – Design a model that works on that sample
3 – Make predictions for the whole population
Dependent vairable (Predicted) – Y
Independent Variable(predictors) – x1, x2…. xn
Y = F(x1, x2…. xn)
We look for the casualtiy in that case.
For example the salary is related to the number of years studied by an individual.
y = B + 5000(x) + E
y = B + 0 + E
y – Income
B – Minimum wage
Correlation vs Causality
Correlation does not mean causality
Causality is one way whereas correlarion is both way.

1)Sum of squares total(SST)or Total Sum of Squares(SST) It is the measurement of the total vasriability around mean.

2)Sum of squares regression(SSR) or Explained Sum of Squares(ESS) It is the total variability of the predicted values around mean. If SSR is equal to SST, the model defines all the variaility and is perfect.

3)Sum of squares error(SSE) or Residual Sum of Squares(RSS)

SST = SSR + SSE Lesser the SSE, more is the variability explained by the regression model.

R Squared(Goodness of fit) – To know how powerful the regression line is we use a statistic called R squared. It is defined as the ratio of the variability explained by the regression to that of the total variabilty of the dataset. R Squared = SSR/SST

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt 
sns.set_style('whitegrid')
%matplotlib inline
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
from sklearn.datasets import load_boston

boston = load_boston()
print(boston.DESCR)

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
.. topic:: References

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.

plt.hist(boston.target, bins=50)
plt.xlabel('Price of houses in $1000s')
plt.ylabel('Number of houses')

Text(0, 0.5, 'Number of houses')

#Let us have a scatter plot of a single feature with the target.
#We will have price of the housing with the numnber of rooms in the dweling.
plt.scatter(boston.data[:,5],boston.target)
plt.ylabel('Price in $1000s')
plt.xlabel('Number of rooms')
#We can see a positive correlation between rooms and price of house.

Text(0.5, 0, 'Number of rooms')

boston_df = pd.DataFrame(boston.data)
boston_df.columns = boston.feature_names
boston_df.head()

boston_df['Price'] = boston.target
boston_df.head()

from IPython.display import Image
url = 'https://upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Linear_least_squares_example2.svg/200px-Linear_least_squares_example2.svg.png'
Image(url)
#A plot of the data points (in red), the least squares line of best fit (in blue), and the residuals (in green)
#The line with least amount of variation on y axis will be the best fit line.

X = boston_df.RM

X = np.vstack(boston_df.RM)
X.shape

(506, 1)

Y = np.array(boston_df.Price)
Y.shape

(506,)

#X = np.array( [ [ value,1] for value in X])

#X

array([[array([6.575]), 1],
       [array([6.421]), 1],
       [array([7.185]), 1],
       ...,
       [array([6.976]), 1],
       [array([6.794]), 1],
       [array([6.03]), 1]], dtype=object)

result = np.linalg.lstsq(X,Y)
error_total = result[1]
rmse = np.sqrt(error_total/len(X))
rmse

/Users/sourabh/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: FutureWarning: `rcond` parameter will change to the default of machine precision times ``max(M, N)`` where M and N are the input matrix dimensions.
To use the future default and silence this warning we advise to pass `rcond=None`, to keep using the old, explicitly pass `rcond=-1`.
  """Entry point for launching an IPython kernel.

array([7.64268509])

plt.plot()

[]

#Now using Scikit Learn

import sklearn
from sklearn.linear_model import LinearRegression
lreg = LinearRegression()
X_multi = boston_df.drop('Price',1)
Y_target = boston_df.Price
#lreg.fit(X_multi, Y_target)
#print('Intercept is : {}'.format(lreg.intercept_))
#print('Number of coefficients are : {}'.format(len(lreg.coef_)))

coeff_df = pd.DataFrame(boston_df.columns)
coeff_df.columns = ['Features']
coeff_df['Coefficient Estimate'] = pd.Series(lreg.coef_) 
coeff_df

X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split(X, boston_df.Price)
lreg.fit(X_train, Y_train)

print(X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)
print('Intercept is : {}'.format(lreg.intercept_))
print('Number of coefficients are : {}'.format(len(lreg.coef_)))
pred_train = lreg.predict(X_train)
pred_test = lreg.predict(X_test)
df_comp = pd.DataFrame({'Actual': Y_test, 'Predicted':pred_test})
#df_comp

plt.scatter(X_test, Y_test,  color='gray')
plt.plot(X_test, pred_test, color='red', linewidth=2)
plt.show()

(379, 1) (127, 1) (379,) (127,)
Intercept is : -37.855302800021974
Number of coefficients are : 1

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3.0	222.0	18.7	394.63	2.94
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	Price
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98	24.0
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14	21.6
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03	34.7
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3.0	222.0	18.7	394.63	2.94	33.4
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33	36.2

	Features	Coefficient Estimate
0	CRIM	-0.108011
1	ZN	0.046420
2	INDUS	0.020559
3	CHAS	2.686734
4	NOX	-17.766611
5	RM	3.809865
6	AGE	0.000692
7	DIS	-1.475567
8	RAD	0.306049
9	TAX	-0.012335
10	PTRATIO	-0.952747
11	B	0.009312
12	LSTAT	-0.524758
13	Price	NaN