Project 2 : Washington Post Police Shooting Data
Project 1 updated
Oct 13 – K means clustering for location
KMeans clustering for location
KMeans clustering is a powerful tool for location analysis, enabling the grouping of geospatial data points into clusters based on their proximity or similarity. This technique is valuable for segmenting locations, understanding market trends, optimizing resource allocation, and making informed decisions in various fields, from retail and urban planning to healthcare and environmental analysis.
First let us do a scatter plot of the latitude and longitude data to get an idea of the data
Now let us do the Kmeans clustering with a cluster size of 100
Cluster 1 Center: Latitude 87.77466608391609, Longitude 40.67427972027972
Cluster 2 Center: Latitude 121.12838608695652, Longitude 41.12909739130435
Cluster 3 Center: Latitude 84.15862464985995, Longitude 37.49454341736695
Cluster 4 Center: Latitude 105.33088211382113, Longitude 34.17090650406504
Cluster 5 Center: Latitude 117.77351302083333, Longitude 34.66760546875
Cluster 6 Center: Latitude 82.29514070351759, Longitude 32.234090452261306
Cluster 7 Center: Latitude 95.39037857142857, Longitude 29.66034642857143
Cluster 8 Center: Latitude 74.44290510948905, Longitude 40.786705596107055
Cluster 9 Center: Latitude 76.20114640883978, Longitude 38.916165745856354
Cluster 10 Center: Latitude 106.63796355353075, Longitude 40.951282460136675
Cluster 11 Center: Latitude 149.4990465116279, Longitude 62.47753488372093
Cluster 12 Center: Latitude 81.8052558922559, Longitude 28.721195286195286
Cluster 13 Center: Latitude 156.97926470588237, Longitude 20.828588235294117
Cluster 14 Center: Latitude 94.13923595505618, Longitude 42.08931086142322
Cluster 15 Center: Latitude 122.21451034482759, Longitude 47.371468965517245
Cluster 16 Center: Latitude 97.51444057971014, Longitude 31.395452173913043
Cluster 17 Center: Latitude 88.35704276315789, Longitude 31.789996710526317
Cluster 18 Center: Latitude 81.17313015873016, Longitude 40.23007619047619
Cluster 19 Center: Latitude 94.6525529953917, Longitude 35.52384331797235
Cluster 20 Center: Latitude 112.77275471698113, Longitude 33.27761455525606
Oct 11 – Police shooting data overview
Police Shooting Data Overview
We have been given the report on police shooting data by the Washington Post. Today, we will plot the trends in the significant columns to gain some basic insights into the data
Columns: [‘id’, ‘name’, ‘date’, ‘manner_of_death’, ‘armed’, ‘age’, ‘gender’,
‘race’, ‘city’, ‘state’, ‘signs_of_mental_illness’, ‘threat_level’,
‘flee’, ‘body_camera’, ‘longitude’, ‘latitude’, ‘is_geocoding_exact’]
Project Report I
Attaching the punchline report for project 1 –Predicting Diabetes Prevalence from Obesity and Inactivity: An Analysis of Health Disparities
MTH522_Project_1Week 4: 2 Oct – Understanding Ttest II
Understanding Ttest II
In the last post, we saw what a T distribution and Normal distribution is. Now, let us lok at some key features
In a normal distribution, data tends to cluster around the mean (68% of data lies within i standard deviation around the mean and 99% of data lies within 3 standard deviations). As one moves farther away from the mean, the frequency of data points decreases exponentially. This phenomenon implies that the probability of an event occurring is closely tied to its proximity to the mean value. This correlation is of paramount importance because it underscores the mean’s effectiveness as a precise descriptor of the distribution.
Understanding the mean value provides valuable insights into the population and its behavior. This is precisely why normality is crucial for conducting a TTest. When dealing with a population that does not exhibit a normal distribution, there is no assurance that the population mean carries inherent significance on its own. Consequently, knowledge of the mean may provide little to no meaningful information about the dataset. In such cases, conducting a ttest becomes a futile exercise because determining whether the difference in means is statistically significant offers no meaningful insights when the means themselves lack significance.
Central Limit Theorem
The central limit theorem states that as we sample data from any population, regardless of the population distribution, the samples’ means tends towards a normal distribution as the sample size increases, i.e, Given a sufficiently large sample size from ay distribution, the sample means will be normally distributed
The Central Limit Theorem plays a pivotal role in the widespread application of Ttests. As previously discussed, Ttests are most effective when applied to populations that exhibit a normal distribution. However, according the Central Limit Theorem, for any given population, if we collect a sufficiently large number of random samples from it, the cumulative distribution of sample means tends to follow a normal distribution. This phenomenon allows us to apply Ttests to the derived sample population, even when the original population may not be normally distributed.
Week 3: 29 Sept – Understatnding Ttest I
Understatnding Ttest
We had already explored TTest and its role in understanding the statistical significance of a distributions mean. For a ttest to have a menaingful result, the distrivutions must satisfy the following conditions:

 The data sets must be normally distributed, i.e, the shape must resemble a bell curve to an extent
 are independent and continuous, i.e., the measurement scale for data should follow a continuous pattern.
 Variance of data in both sample groups is similar, i.e., samples have almost equal standard deviation
Today, we will deep dive into these to understands why these conditions are necessary. But first, le us understand what a TDistribution is
T Distribution
 The tdistribution, also known as the Student’s tdistribution, is a probability distribution that is similar in shape to the standard normal distribution (bellshaped curve).
 The key feature of the tdistribution is that it has heavier tails compared to the normal distribution. The shape of the tdistribution depends on a parameter called degrees of freedom (df).
 As the sample size increases, the tdistribution approaches the standard normal distribution.
 In hypothesis testing with the ttest, the tdistribution is used as a reference distribution to determine the critical values for a specified level of significance (alpha) and degrees of freedom.
Normal distribution (zdistribution) is essentially a special case of t distribution. But whats important for us are certain properties that are common to both but is more prominent in the normal ditribution
Week 3: Sept 27 – BreuschPagan Test
BreuschPagan Test
In the world of data science and regression analysis, the BreuschPagan test is like a detective tool that helps us investigate an important issue called “heteroscedasticity.” Let me break it down for you.
Heteroscedasticity is a fancy term for a situation where things are not as tidy as we’d like in a regression analysis. Specifically, it’s when the spread of your residuals)changes as you move along the independent variables. High heteroscedasticity would mean that your prediction errors (the differences between your predictions and the actual values) vary differently across different inputs. Some predictions might be pretty close, while others are way off.
the BreuschPagan test is used for detecting this variability issue. Here’s how it works:
 Build your regression model: You start by creating a regression model that tries to predict something, like housing prices.
 Calculate residuals: Residuals are the differences between your predictions and the actual prices for each house.
 Squared Residuals: You square those residuals. This step emphasizes larger errors more than smaller ones.
 Second Regression: Next, you build a new miniregression model. This time, you use the squared residuals as your “dependent variable” (the thing you’re trying to predict), and the same predictors you used in your original model.
 Hypothesis Testing: You perform a hypothesis test to see if your predictors are related to the squared residuals. If they are, it’s a sign that heteroscedasticity might be present.
If the BreuschPagan test suggests heteroscedasticity is happening, it means our original regression model isn’t performing as well as we thought.
So, as data scientists, We would want to investigate further, maybe try different modeling techniques or transform your data to make the errors more consistent across the board. The goal is to have a model that’s as accurate as possible for all cases, not just some.
In a nutshell, the BreuschPagan test helps us spot when the “scatter” of our errors isn’t the same for all data points, and that’s a signal for us to dig deeper and refine our models.
Week 3 – Sept 25th: Plotting polynomial regression models
Plotting polynomial regression models
Today, we will attempt to plot polynomial regression models of differrent degrees and compare the regression lines and the Rsquared values We will be using the Inactivity vs Diabetets data to perform this analysis as it has the maximum number of data points available.
The Polynomial regression is performed using the sklearn package that provides the inbuilt function PolynomialFeatures() that allows us to model the conersion matrix with the polynomial degree as parameter
Then we create the regression model using the LinearRegression() function and fit the model to our data. Once the model is created, it is a fairly straight forward process of using the model to predict the values and use this prediction to calculate the rquared values for each degree model
Outputs:
As expected, the rsquared values shows slight progressive improvement with each successive degree, but at the same the its is clear from the plot that the model progresively tends to be overfitted making it less effective in predictions
Week 2 – Sept 22: TTest
T Test
Today, Im exploring Ttest and its significance. T test is a type of hypothesis testing and its a very important tool in data science. A hypothesis is any testable assupmtion about the data set and hypothesis testing allows us to validate these assumptions
Ttest is predominantly used to understand whether the differrence in means of two datasets have any statistical significance. For T test to provide any meaningful insights, the datasets has to satisfy the following conditions


 The data sets must be normally distributed, i.e, the shape must resemble a bell curve to an extent
 are independent and continuous, i.e., the measurement scale for data should follow a continuous pattern.
 Variance of data in both sample groups is similar, i.e., samples have almost equal standard deviation

Hypotheses:

 H0: There is a significant differrence between the means of the data sets
 H1: There is no significant differrence between the means of the data sets
T Test code
Results:
Reject the null hypothesis. There is a significant difference between the datasets. Tstatistic: 8.586734600367794 Pvalue: 1.960253729590773e17
Note:
This Ttest does not provide any meaningful insights as two of the requisite conditions are violates

 The datasets are not normally distributes
 the variances of the datasets are not quite similar
Week 2 – 20th Sept
Today, I explored validation techniques for smaller data sets, namely KFold cross validation.
To start, the linear regression model was reretrained fusing 70% of the data as a training set and 30% as the test set. Here are the results obtained
As we can see, the model shows similar performance with rsquared = 0.38 approx
Now the same model was tested again using Kfold crossvalidtion with 5 folds. Here are the results for the linear and polynomial regression models
The linear and polynomial models both show similar mean rsquared values of 0.30, which is lower than the score obtained without using crossvalidation.
The polynomial regression score will tend to increase with higher degrees of polynomial if we validate using the test data as it leads to overfitting
Week 2 – 18th
Today, we were introduced to the idea of linear regression using multiple varibale. This technique is essential when we have more than one predictors, especially if they are highly correlated like the Obesity and Inactivty data.
The linear regression model for multi variable was built using sklearn package from scikitlear module whcih provides many inbuilt function for linear regression
Expression: Y = B0 + B1.X_obesity + B2.X_inactivity + e
Result for 2 variable
B0: 1.6535991518559392 Coefficients:
[('B1', 0.23246991917672563), ('B2', 0.11106296576800405)
R^2 =
0.34073967115731385
As expected, there is an improvement in the Rsquared value as compared to single variable model.
Now, we introduce one more predictor variable as the product of inactivitya and obesity
Y = B0 + B1*X_inactivity + B2*X_obesity + B3*X_inactivity * X_obestiy + e
Result:
B0: 10.06467453049291 Coefficients: [('INACTIVE', 1.1533595376219894), ('OBESE', 0.743039972401428), ('Obese x inactive', 0.049635909945020235)] Rsquared = 0.36458725864661756
As expected, the performace has increased again, albeit by a very small margin
Now, lets try adding two more predictors – X1^2 and X2^2
Y = B0 + B1*X_inactivity + B2*X_obesity + B3*X_inactivity * X_obestiy + B4*X1^2 + B5*X2^2 + e
Result:
B0: 11.590588857309138 Coefficients: [('INACTIVE', 0.47789170236400547), ('OBESE', 1.489656837325879), ('Obese x inactive', 0.01970764143007776), ('Inactivity sqaure', 0.01973523748870601), ('Obesity square', 0.04942722743255474)] Score = 0.38471232946078504
The score has improved yet again.
It seems this process adding higher powers of the predictor model is an effective way of improving the accuracy of the model, although it can no longer be considered as a linear model and this is now a quadratic model. But using this process infinitely to get a nearly perfect score can lead to overfitting redering the model inefective to predicting new data
To properly validate, we need to test the model accuracy based on a new traiing data but as we have limited data available, other validation techniques must be explored .
MTH 522 – Week 1
In the first class I was introduced to the conecpt of linear regression and how to model a simple predictor function using this technique. My first thought was to code a linear regression model for the CDCdiabetes adat set for each of the Predictive factors, i.e, Diabetes vs Obesity and Diabetes vs Inactivity seperately.
This was more challenging than i had expected because of my limited experience in data analysis techniques with python. I spent cosiderable amount of time trying to merge the data and get it in the form that was most suitable to apply the linear regression model.
Once Ithe data was successfully transformed, it was a straightforward task to get the summarry statistcs of each of the predictors seperately.
I was interesting to observe that the relation between Diabetes and Obesity is more heteroskedastic in nature, i.e, the as the obesity % increases, the variance of the data also increases which is rther counter intuitive as you would expect the county with highere obesity% to have more diabetic people, wheares the relation between Diabetes and inactivity is more homoskedastic which stands to reason
Furthermore, there is a significant positive correlation between the predictors – 75% which is also expected as inactivity tends to cause obesity
I built two linear regression models based on each of the predictors independantly
 Diabetes inactivity R^2 = 0.3216066463149296
 Diabetes – inactive R^2 = 0.148475949010913
As expected, the linear regression model built with inacticity is almost twice as good as the one build with obseity due to the more skewed nature of the obesity data
Hello world!
Welcome to UMassD WordPress. This is your first post. Edit or delete it, then start blogging!