Oct 13 – K means clustering for location

K-Means clustering for location

 

K-Means clustering is a powerful tool for location analysis, enabling the grouping of geospatial data points into clusters based on their proximity or similarity. This technique is valuable for segmenting locations, understanding market trends, optimizing resource allocation, and making informed decisions in various fields, from retail and urban planning to healthcare and environmental analysis.

 

First let us do a scatter plot of the latitude and longitude data to get an idea of the data

 

Now let us do the K-means clustering with a cluster size of 100

 

Cluster 1 Center: Latitude -87.77466608391609, Longitude 40.67427972027972
Cluster 2 Center: Latitude -121.12838608695652, Longitude 41.12909739130435
Cluster 3 Center: Latitude -84.15862464985995, Longitude 37.49454341736695
Cluster 4 Center: Latitude -105.33088211382113, Longitude 34.17090650406504
Cluster 5 Center: Latitude -117.77351302083333, Longitude 34.66760546875
Cluster 6 Center: Latitude -82.29514070351759, Longitude 32.234090452261306
Cluster 7 Center: Latitude -95.39037857142857, Longitude 29.66034642857143
Cluster 8 Center: Latitude -74.44290510948905, Longitude 40.786705596107055
Cluster 9 Center: Latitude -76.20114640883978, Longitude 38.916165745856354
Cluster 10 Center: Latitude -106.63796355353075, Longitude 40.951282460136675
Cluster 11 Center: Latitude -149.4990465116279, Longitude 62.47753488372093
Cluster 12 Center: Latitude -81.8052558922559, Longitude 28.721195286195286
Cluster 13 Center: Latitude -156.97926470588237, Longitude 20.828588235294117
Cluster 14 Center: Latitude -94.13923595505618, Longitude 42.08931086142322
Cluster 15 Center: Latitude -122.21451034482759, Longitude 47.371468965517245
Cluster 16 Center: Latitude -97.51444057971014, Longitude 31.395452173913043
Cluster 17 Center: Latitude -88.35704276315789, Longitude 31.789996710526317
Cluster 18 Center: Latitude -81.17313015873016, Longitude 40.23007619047619
Cluster 19 Center: Latitude -94.6525529953917, Longitude 35.52384331797235
Cluster 20 Center: Latitude -112.77275471698113, Longitude 33.27761455525606

 

 

Oct 11 – Police shooting data overview

Police Shooting Data Overview

We have been given the report on police shooting data by the Washington Post. Today, we will plot the trends in the significant columns to gain some basic insights into the data

Columns: [‘id’, ‘name’, ‘date’, ‘manner_of_death’, ‘armed’, ‘age’, ‘gender’,
‘race’, ‘city’, ‘state’, ‘signs_of_mental_illness’, ‘threat_level’,
‘flee’, ‘body_camera’, ‘longitude’, ‘latitude’, ‘is_geocoding_exact’]

 

 

Week 4: 2 Oct – Understanding T-test II

Understanding T-test II

In the last post, we saw what a T distribution and Normal distribution is. Now, let us lok at some key features

In a normal distribution, data tends to cluster around the mean (68% of data lies within i standard deviation around the mean and 99% of data lies within 3 standard deviations). As one moves farther away from the mean, the frequency of data points decreases exponentially. This phenomenon implies that the probability of an event occurring is closely tied to its proximity to the mean value. This correlation is of paramount importance because it underscores the mean’s effectiveness as a precise descriptor of the distribution.

Understanding the mean value provides valuable insights into the population and its behavior. This is precisely why normality is crucial for conducting a T-Test. When dealing with a population that does not exhibit a normal distribution, there is no assurance that the population mean carries inherent significance on its own. Consequently, knowledge of the mean may provide little to no meaningful information about the dataset. In such cases, conducting a t-test becomes a futile exercise because determining whether the difference in means is statistically significant offers no meaningful insights when the means themselves lack significance.

 

Central Limit Theorem

The central limit theorem states that as we sample data from any population, regardless of the population distribution, the samples’ means tends towards a normal distribution as the sample size increases, i.e, Given a sufficiently large sample size from ay distribution, the sample means will be normally distributed

The Central Limit Theorem plays a pivotal role in the widespread application of T-tests. As previously discussed, T-tests are most effective when applied to populations that exhibit a normal distribution. However, according the Central Limit Theorem, for any given population, if we collect a sufficiently large number of random samples from it, the cumulative distribution of sample means tends to follow a normal distribution. This phenomenon allows us to apply T-tests to the derived sample population, even when the original population may not be normally distributed.

 

 

Week 3: 29 Sept – Understatnding T-test I

Understatnding T-test

We had already explored T-Test and its role in understanding the statistical significance of a distributions mean. For a t-test to have a menaingful result, the distrivutions must satisfy the following conditions:

    • The data sets must be normally distributed, i.e, the shape must resemble a bell curve to an extent
    • are independent and continuous, i.e., the measurement scale for data should follow a continuous pattern.
    • Variance of data in both sample groups is similar, i.e., samples have almost equal standard deviation

Today, we will deep dive into these to understands why these conditions are necessary. But first, le us understand what a T-Distribution is

T Distribution

  • The t-distribution, also known as the Student’s t-distribution, is a probability distribution that is similar in shape to the standard normal distribution (bell-shaped curve).
  • The key feature of the t-distribution is that it has heavier tails compared to the normal distribution. The shape of the t-distribution depends on a parameter called degrees of freedom (df).
  • As the sample size increases, the t-distribution approaches the standard normal distribution.
  • In hypothesis testing with the t-test, the t-distribution is used as a reference distribution to determine the critical values for a specified level of significance (alpha) and degrees of freedom.

 

Normal distribution (z-distribution) is essentially a special case of t distribution. But whats important for us are certain properties that are common to both but is more prominent in the normal ditribution

 

 

Week 3: Sept 27 – Breusch-Pagan Test

 

Breusch-Pagan Test

In the world of data science and regression analysis, the Breusch-Pagan test is like a detective tool that helps us investigate an important issue called “heteroscedasticity.” Let me break it down for you.

Heteroscedasticity is a fancy term for a situation where things are not as tidy as we’d like in a regression analysis. Specifically, it’s when the spread of your residuals)changes as you move along the independent variables. High heteroscedasticity would mean that your prediction errors (the differences between your predictions and the actual values) vary differently across different inputs. Some predictions might be pretty close, while others are way off.

the Breusch-Pagan test is used for detecting this variability issue. Here’s how it works:

  1. Build your regression model: You start by creating a regression model that tries to predict something, like housing prices.
  2. Calculate residuals: Residuals are the differences between your predictions and the actual prices for each house.
  3. Squared Residuals: You square those residuals. This step emphasizes larger errors more than smaller ones.
  4. Second Regression: Next, you build a new mini-regression model. This time, you use the squared residuals as your “dependent variable” (the thing you’re trying to predict), and the same predictors you used in your original model.
  5. Hypothesis Testing: You perform a hypothesis test to see if your predictors are related to the squared residuals. If they are, it’s a sign that heteroscedasticity might be present.

If the Breusch-Pagan test suggests heteroscedasticity is happening, it means our original regression model isn’t performing as well as we thought.

So, as data scientists, We would want to investigate further, maybe try different modeling techniques or transform your data to make the errors more consistent across the board. The goal is to have a model that’s as accurate as possible for all cases, not just some.

In a nutshell, the Breusch-Pagan test helps us spot when the “scatter” of our errors isn’t the same for all data points, and that’s a signal for us to dig deeper and refine our models.

Week 3 – Sept 25th: Plotting polynomial regression models

Plotting polynomial regression models

Today, we will attempt to plot polynomial regression models of differrent degrees and compare the regression lines and the R-squared values We will be using the Inactivity vs Diabetets data to perform this analysis as it has the maximum number of data points available.

The Polynomial regression is performed using the sklearn package that provides the inbuilt function PolynomialFeatures()  that allows us to model the conersion matrix with the polynomial degree as parameter

 

Then we create the regression model using the LinearRegression() function and fit the model to our data. Once the model is created, it is a fairly straight forward process of using the model to predict the values and use this prediction to calculate the r-quared values for each degree model

 

Outputs:

 

 

As expected, the r-squared values shows slight progressive improvement with each successive degree, but at the same the its is clear from the plot that the model progresively tends to be overfitted making it less effective in predictions

Week 2 – Sept 22: T-Test

T Test

Today, Im exploring T-test and its significance. T test is a type of hypothesis testing and its a very important tool in data science. A hypothesis is any testable assupmtion about the data set and hypothesis testing allows us to validate these assumptions

T-test is predominantly used to understand whether the differrence in means of two datasets have any statistical significance. For T test to provide any meaningful insights, the datasets has to satisfy the following conditions

      • The data sets must be normally distributed, i.e, the shape must resemble a bell curve to an extent
      • are independent and continuous, i.e., the measurement scale for data should follow a continuous pattern.
      • Variance of data in both sample groups is similar, i.e., samples have almost equal standard deviation

Hypotheses:

    • H0: There is a significant differrence between the means of the data sets
    • H1: There is no significant differrence between the means of the data sets

T Test code

Results:

Reject the null hypothesis. There is a significant difference between the datasets.
T-statistic: -8.586734600367794
P-value: 1.960253729590773e-17

Note:

This T-test does not provide any meaningful insights as two of the requisite conditions are violates

    1. The datasets are not normally distributes
    2. the variances of the datasets are not quite similar

Week 2 – 20th Sept

Today, I explored validation techniques for smaller data sets, namely K-Fold cross validation.

To start, the linear regression model was re-retrained fusing 70% of the data as a training set and 30% as the test set. Here are the results obtained

As we can see, the model shows similar performance with r-squared = 0.38 approx

Now the same model was tested again using K-fold cross-validtion with 5 folds. Here are the results for the linear and polynomial regression models

The linear and polynomial models both show similar mean r-squared values of 0.30, which is lower than the score obtained without using cross-validation.

The polynomial regression score will tend to increase with higher degrees of polynomial if we validate using the test data as it leads to overfitting

Week 2 – 18th

Today, we were introduced to the idea of linear regression using multiple varibale. This technique is essential when we have more than one predictors, especially if they are highly correlated like the Obesity and Inactivty data.

The linear regression model for multi variable was built using sklearn package from scikit-lear module whcih provides many inbuilt function for linear regression

Expression: Y = B0 + B1.X_obesity + B2.X_inactivity + e

Result for 2 variable

R^2 =

0.34073967115731385

As expected, there is an improvement in the R-squared value as compared to single variable model.

Now, we introduce one more predictor variable as the product of inactivitya and obesity

Y = B0 + B1*X_inactivity + B2*X_obesity + B3*X_inactivity * X_obestiy + e

Result:

B0:  -10.06467453049291
Coefficients: [('INACTIVE', 1.1533595376219894), ('OBESE', 0.743039972401428), ('Obese x inactive', -0.049635909945020235)]
R-squared =  0.36458725864661756

As expected, the performace has increased again, albeit by a very small margin

 

Now, lets try adding two more predictors – X1^2 and X2^2

Y = B0 + B1*X_inactivity + B2*X_obesity + B3*X_inactivity * X_obestiy + B4*X1^2 + B5*X2^2 + e

Result:

B0:  -11.590588857309138
Coefficients: [('INACTIVE', 0.47789170236400547), ('OBESE', 1.489656837325879), ('Obese x inactive', 0.01970764143007776), ('Inactivity sqaure', -0.01973523748870601), ('Obesity square', -0.04942722743255474)]
Score =  0.38471232946078504

The score has improved yet again.

It seems this process adding higher powers of the predictor model is an effective way of improving the accuracy of the model, although it can no longer be considered as a linear model and this is now a quadratic model. But using this process infinitely to get a nearly perfect score can lead to overfitting redering the model inefective to predicting new data

To properly validate, we need to test the model accuracy based on a new traiing data but as we have limited data available, other validation techniques must be explored .

 

MTH 522 – Week 1

In the first class I was introduced to the conecpt of linear regression and how to model a simple predictor function using this technique. My first thought was to code a linear regression model for the CDC-diabetes adat set for each of the Predictive factors, i.e, Diabetes vs Obesity and Diabetes vs Inactivity seperately.

This was more challenging than i had expected because of my limited experience in data analysis techniques with python. I spent cosiderable amount of time trying to merge the data and get it in the form that was most suitable to apply the linear regression model.

Once Ithe data was successfully transformed, it was a straightforward task to get the summarry statistcs of each of the predictors seperately.

I was interesting to observe that the relation between Diabetes and Obesity is more heteroskedastic in nature, i.e, the as the obesity % increases, the variance of the data also increases which is rther counter intuitive as you would expect the county with highere obesity% to have more diabetic people, wheares the relation between Diabetes and inactivity is more homoskedastic which stands to reason

Furthermore, there is a significant positive correlation between the predictors – 75% which is also expected as inactivity tends to cause obesity

I built two linear regression models based on each of the predictors independantly

  1. Diabetes- inactivity R^2 = 0.3216066463149296
  2. Diabetes – inactive R^2 =  0.148475949010913

As expected, the linear regression model built with inacticity is almost twice as good as the one build with obseity due to the more skewed nature of the obesity data