In the first class I was introduced to the conecpt of linear regression and how to model a simple predictor function using this technique. My first thought was to code a linear regression model for the CDC-diabetes adat set for each of the Predictive factors, i.e, Diabetes vs Obesity and Diabetes vs Inactivity seperately.
This was more challenging than i had expected because of my limited experience in data analysis techniques with python. I spent cosiderable amount of time trying to merge the data and get it in the form that was most suitable to apply the linear regression model.
Once Ithe data was successfully transformed, it was a straightforward task to get the summarry statistcs of each of the predictors seperately.
I was interesting to observe that the relation between Diabetes and Obesity is more heteroskedastic in nature, i.e, the as the obesity % increases, the variance of the data also increases which is rther counter intuitive as you would expect the county with highere obesity% to have more diabetic people, wheares the relation between Diabetes and inactivity is more homoskedastic which stands to reason
Furthermore, there is a significant positive correlation between the predictors – 75% which is also expected as inactivity tends to cause obesity
I built two linear regression models based on each of the predictors independantly
- Diabetes- inactivity R^2 = 0.3216066463149296
- Diabetes – inactive R^2 = 0.148475949010913
As expected, the linear regression model built with inacticity is almost twice as good as the one build with obseity due to the more skewed nature of the obesity data