Today, we were introduced to the idea of linear regression using multiple varibale. This technique is essential when we have more than one predictors, especially if they are highly correlated like the Obesity and Inactivty data.

The linear regression model for multi variable was built using sklearn package from scikit-lear module whcih provides many inbuilt function for linear regression

Expression: Y = B0 + B1.X_obesity + B2.X_inactivity + e

Result for 2 variable

B0: 1.6535991518559392 Coefficients:

[('B1', 0.23246991917672563), ('B2', 0.11106296576800405)

R^2 =

0.34073967115731385

As expected, there is an improvement in the R-squared value as compared to single variable model.

Now, we introduce one more predictor variable as the product of inactivitya and obesity

Y = B0 + B1*X_inactivity + B2*X_obesity + B3*X_inactivity * X_obestiy + e

Result:

B0: -10.06467453049291 Coefficients: [('INACTIVE', 1.1533595376219894), ('OBESE', 0.743039972401428), ('Obese x inactive', -0.049635909945020235)] R-squared = 0.36458725864661756

As expected, the performace has increased again, albeit by a very small margin

Now, lets try adding two more predictors – X1^2 and X2^2

Y = B0 + B1*X_inactivity + B2*X_obesity + B3*X_inactivity * X_obestiy + B4*X1^2 + B5*X2^2 + e

Result:

B0: -11.590588857309138 Coefficients: [('INACTIVE', 0.47789170236400547), ('OBESE', 1.489656837325879), ('Obese x inactive', 0.01970764143007776), ('Inactivity sqaure', -0.01973523748870601), ('Obesity square', -0.04942722743255474)] Score = 0.38471232946078504

The score has improved yet again.

It seems this process adding higher powers of the predictor model is an effective way of improving the accuracy of the model, although it can no longer be considered as a linear model and this is now a quadratic model. But using this process infinitely to get a nearly perfect score can lead to overfitting redering the model inefective to predicting new data

To properly validate, we need to test the model accuracy based on a new traiing data but as we have limited data available, other validation techniques must be explored .