Week 2 – Sept 22: T-Test

T Test

Today, Im exploring T-test and its significance. T test is a type of hypothesis testing and its a very important tool in data science. A hypothesis is any testable assupmtion about the data set and hypothesis testing allows us to validate these assumptions

T-test is predominantly used to understand whether the differrence in means of two datasets have any statistical significance. For T test to provide any meaningful insights, the datasets has to satisfy the following conditions

      • The data sets must be normally distributed, i.e, the shape must resemble a bell curve to an extent
      • are independent and continuous, i.e., the measurement scale for data should follow a continuous pattern.
      • Variance of data in both sample groups is similar, i.e., samples have almost equal standard deviation

Hypotheses:

    • H0: There is a significant differrence between the means of the data sets
    • H1: There is no significant differrence between the means of the data sets

T Test code

Results:

Reject the null hypothesis. There is a significant difference between the datasets.
T-statistic: -8.586734600367794
P-value: 1.960253729590773e-17

Note:

This T-test does not provide any meaningful insights as two of the requisite conditions are violates

    1. The datasets are not normally distributes
    2. the variances of the datasets are not quite similar

Week 2 – 20th Sept

Today, I explored validation techniques for smaller data sets, namely K-Fold cross validation.

To start, the linear regression model was re-retrained fusing 70% of the data as a training set and 30% as the test set. Here are the results obtained

As we can see, the model shows similar performance with r-squared = 0.38 approx

Now the same model was tested again using K-fold cross-validtion with 5 folds. Here are the results for the linear and polynomial regression models

The linear and polynomial models both show similar mean r-squared values of 0.30, which is lower than the score obtained without using cross-validation.

The polynomial regression score will tend to increase with higher degrees of polynomial if we validate using the test data as it leads to overfitting

Week 2 – 18th

Today, we were introduced to the idea of linear regression using multiple varibale. This technique is essential when we have more than one predictors, especially if they are highly correlated like the Obesity and Inactivty data.

The linear regression model for multi variable was built using sklearn package from scikit-lear module whcih provides many inbuilt function for linear regression

Expression: Y = B0 + B1.X_obesity + B2.X_inactivity + e

Result for 2 variable

R^2 =

0.34073967115731385

As expected, there is an improvement in the R-squared value as compared to single variable model.

Now, we introduce one more predictor variable as the product of inactivitya and obesity

Y = B0 + B1*X_inactivity + B2*X_obesity + B3*X_inactivity * X_obestiy + e

Result:

B0:  -10.06467453049291
Coefficients: [('INACTIVE', 1.1533595376219894), ('OBESE', 0.743039972401428), ('Obese x inactive', -0.049635909945020235)]
R-squared =  0.36458725864661756

As expected, the performace has increased again, albeit by a very small margin

 

Now, lets try adding two more predictors – X1^2 and X2^2

Y = B0 + B1*X_inactivity + B2*X_obesity + B3*X_inactivity * X_obestiy + B4*X1^2 + B5*X2^2 + e

Result:

B0:  -11.590588857309138
Coefficients: [('INACTIVE', 0.47789170236400547), ('OBESE', 1.489656837325879), ('Obese x inactive', 0.01970764143007776), ('Inactivity sqaure', -0.01973523748870601), ('Obesity square', -0.04942722743255474)]
Score =  0.38471232946078504

The score has improved yet again.

It seems this process adding higher powers of the predictor model is an effective way of improving the accuracy of the model, although it can no longer be considered as a linear model and this is now a quadratic model. But using this process infinitely to get a nearly perfect score can lead to overfitting redering the model inefective to predicting new data

To properly validate, we need to test the model accuracy based on a new traiing data but as we have limited data available, other validation techniques must be explored .

 

MTH 522 – Week 1

In the first class I was introduced to the conecpt of linear regression and how to model a simple predictor function using this technique. My first thought was to code a linear regression model for the CDC-diabetes adat set for each of the Predictive factors, i.e, Diabetes vs Obesity and Diabetes vs Inactivity seperately.

This was more challenging than i had expected because of my limited experience in data analysis techniques with python. I spent cosiderable amount of time trying to merge the data and get it in the form that was most suitable to apply the linear regression model.

Once Ithe data was successfully transformed, it was a straightforward task to get the summarry statistcs of each of the predictors seperately.

I was interesting to observe that the relation between Diabetes and Obesity is more heteroskedastic in nature, i.e, the as the obesity % increases, the variance of the data also increases which is rther counter intuitive as you would expect the county with highere obesity% to have more diabetic people, wheares the relation between Diabetes and inactivity is more homoskedastic which stands to reason

Furthermore, there is a significant positive correlation between the predictors – 75% which is also expected as inactivity tends to cause obesity

I built two linear regression models based on each of the predictors independantly

  1. Diabetes- inactivity R^2 = 0.3216066463149296
  2. Diabetes – inactive R^2 =  0.148475949010913

As expected, the linear regression model built with inacticity is almost twice as good as the one build with obseity due to the more skewed nature of the obesity data