Github: https://github.com/Tiyasa-Saha/MTH522-Project-2/blob/main/Project%202.ipynb
MTH522_Project2
K-Means clustering for location
K-Means clustering is a powerful tool for location analysis, enabling the grouping of geospatial data points into clusters based on their proximity or similarity. This technique is valuable for segmenting locations, understanding market trends, optimizing resource allocation, and making informed decisions in various fields, from retail and urban planning to healthcare and environmental analysis.
First let us do a scatter plot of the latitude and longitude data to get an idea of the data
Now let us do the K-means clustering with a cluster size of 100
Cluster 1 Center: Latitude -87.77466608391609, Longitude 40.67427972027972
Cluster 2 Center: Latitude -121.12838608695652, Longitude 41.12909739130435
Cluster 3 Center: Latitude -84.15862464985995, Longitude 37.49454341736695
Cluster 4 Center: Latitude -105.33088211382113, Longitude 34.17090650406504
Cluster 5 Center: Latitude -117.77351302083333, Longitude 34.66760546875
Cluster 6 Center: Latitude -82.29514070351759, Longitude 32.234090452261306
Cluster 7 Center: Latitude -95.39037857142857, Longitude 29.66034642857143
Cluster 8 Center: Latitude -74.44290510948905, Longitude 40.786705596107055
Cluster 9 Center: Latitude -76.20114640883978, Longitude 38.916165745856354
Cluster 10 Center: Latitude -106.63796355353075, Longitude 40.951282460136675
Cluster 11 Center: Latitude -149.4990465116279, Longitude 62.47753488372093
Cluster 12 Center: Latitude -81.8052558922559, Longitude 28.721195286195286
Cluster 13 Center: Latitude -156.97926470588237, Longitude 20.828588235294117
Cluster 14 Center: Latitude -94.13923595505618, Longitude 42.08931086142322
Cluster 15 Center: Latitude -122.21451034482759, Longitude 47.371468965517245
Cluster 16 Center: Latitude -97.51444057971014, Longitude 31.395452173913043
Cluster 17 Center: Latitude -88.35704276315789, Longitude 31.789996710526317
Cluster 18 Center: Latitude -81.17313015873016, Longitude 40.23007619047619
Cluster 19 Center: Latitude -94.6525529953917, Longitude 35.52384331797235
Cluster 20 Center: Latitude -112.77275471698113, Longitude 33.27761455525606
Police Shooting Data Overview
We have been given the report on police shooting data by the Washington Post. Today, we will plot the trends in the significant columns to gain some basic insights into the data
Columns: [‘id’, ‘name’, ‘date’, ‘manner_of_death’, ‘armed’, ‘age’, ‘gender’,
‘race’, ‘city’, ‘state’, ‘signs_of_mental_illness’, ‘threat_level’,
‘flee’, ‘body_camera’, ‘longitude’, ‘latitude’, ‘is_geocoding_exact’]
Attaching the punchline report for project 1 –Predicting Diabetes Prevalence from Obesity and Inactivity: An Analysis of Health Disparities
MTH522_Project_1Understanding T-test II
In the last post, we saw what a T distribution and Normal distribution is. Now, let us lok at some key features
In a normal distribution, data tends to cluster around the mean (68% of data lies within i standard deviation around the mean and 99% of data lies within 3 standard deviations). As one moves farther away from the mean, the frequency of data points decreases exponentially. This phenomenon implies that the probability of an event occurring is closely tied to its proximity to the mean value. This correlation is of paramount importance because it underscores the mean’s effectiveness as a precise descriptor of the distribution.
Understanding the mean value provides valuable insights into the population and its behavior. This is precisely why normality is crucial for conducting a T-Test. When dealing with a population that does not exhibit a normal distribution, there is no assurance that the population mean carries inherent significance on its own. Consequently, knowledge of the mean may provide little to no meaningful information about the dataset. In such cases, conducting a t-test becomes a futile exercise because determining whether the difference in means is statistically significant offers no meaningful insights when the means themselves lack significance.
Central Limit Theorem
The central limit theorem states that as we sample data from any population, regardless of the population distribution, the samples’ means tends towards a normal distribution as the sample size increases, i.e, Given a sufficiently large sample size from ay distribution, the sample means will be normally distributed
The Central Limit Theorem plays a pivotal role in the widespread application of T-tests. As previously discussed, T-tests are most effective when applied to populations that exhibit a normal distribution. However, according the Central Limit Theorem, for any given population, if we collect a sufficiently large number of random samples from it, the cumulative distribution of sample means tends to follow a normal distribution. This phenomenon allows us to apply T-tests to the derived sample population, even when the original population may not be normally distributed.
Understatnding T-test
We had already explored T-Test and its role in understanding the statistical significance of a distributions mean. For a t-test to have a menaingful result, the distrivutions must satisfy the following conditions:
Today, we will deep dive into these to understands why these conditions are necessary. But first, le us understand what a T-Distribution is
T Distribution
Normal distribution (z-distribution) is essentially a special case of t distribution. But whats important for us are certain properties that are common to both but is more prominent in the normal ditribution
Breusch-Pagan Test
In the world of data science and regression analysis, the Breusch-Pagan test is like a detective tool that helps us investigate an important issue called “heteroscedasticity.” Let me break it down for you.
Heteroscedasticity is a fancy term for a situation where things are not as tidy as we’d like in a regression analysis. Specifically, it’s when the spread of your residuals)changes as you move along the independent variables. High heteroscedasticity would mean that your prediction errors (the differences between your predictions and the actual values) vary differently across different inputs. Some predictions might be pretty close, while others are way off.
the Breusch-Pagan test is used for detecting this variability issue. Here’s how it works:
If the Breusch-Pagan test suggests heteroscedasticity is happening, it means our original regression model isn’t performing as well as we thought.
So, as data scientists, We would want to investigate further, maybe try different modeling techniques or transform your data to make the errors more consistent across the board. The goal is to have a model that’s as accurate as possible for all cases, not just some.
In a nutshell, the Breusch-Pagan test helps us spot when the “scatter” of our errors isn’t the same for all data points, and that’s a signal for us to dig deeper and refine our models.
Plotting polynomial regression models
Today, we will attempt to plot polynomial regression models of differrent degrees and compare the regression lines and the R-squared values We will be using the Inactivity vs Diabetets data to perform this analysis as it has the maximum number of data points available.
The Polynomial regression is performed using the sklearn package that provides the inbuilt function PolynomialFeatures() that allows us to model the conersion matrix with the polynomial degree as parameter
Then we create the regression model using the LinearRegression() function and fit the model to our data. Once the model is created, it is a fairly straight forward process of using the model to predict the values and use this prediction to calculate the r-quared values for each degree model
Outputs:
As expected, the r-squared values shows slight progressive improvement with each successive degree, but at the same the its is clear from the plot that the model progresively tends to be overfitted making it less effective in predictions
T Test
Today, Im exploring T-test and its significance. T test is a type of hypothesis testing and its a very important tool in data science. A hypothesis is any testable assupmtion about the data set and hypothesis testing allows us to validate these assumptions
T-test is predominantly used to understand whether the differrence in means of two datasets have any statistical significance. For T test to provide any meaningful insights, the datasets has to satisfy the following conditions
Hypotheses:
T Test code
Results:
Reject the null hypothesis. There is a significant difference between the datasets. T-statistic: -8.586734600367794 P-value: 1.960253729590773e-17
Note:
This T-test does not provide any meaningful insights as two of the requisite conditions are violates
Today, I explored validation techniques for smaller data sets, namely K-Fold cross validation.
To start, the linear regression model was re-retrained fusing 70% of the data as a training set and 30% as the test set. Here are the results obtained
As we can see, the model shows similar performance with r-squared = 0.38 approx
Now the same model was tested again using K-fold cross-validtion with 5 folds. Here are the results for the linear and polynomial regression models
The linear and polynomial models both show similar mean r-squared values of 0.30, which is lower than the score obtained without using cross-validation.
The polynomial regression score will tend to increase with higher degrees of polynomial if we validate using the test data as it leads to overfitting
Today, we were introduced to the idea of linear regression using multiple varibale. This technique is essential when we have more than one predictors, especially if they are highly correlated like the Obesity and Inactivty data.
The linear regression model for multi variable was built using sklearn package from scikit-lear module whcih provides many inbuilt function for linear regression
Expression: Y = B0 + B1.X_obesity + B2.X_inactivity + e
Result for 2 variable
B0: 1.6535991518559392 Coefficients:
[('B1', 0.23246991917672563), ('B2', 0.11106296576800405)
R^2 =
0.34073967115731385
As expected, there is an improvement in the R-squared value as compared to single variable model.
Now, we introduce one more predictor variable as the product of inactivitya and obesity
Y = B0 + B1*X_inactivity + B2*X_obesity + B3*X_inactivity * X_obestiy + e
Result:
B0: -10.06467453049291 Coefficients: [('INACTIVE', 1.1533595376219894), ('OBESE', 0.743039972401428), ('Obese x inactive', -0.049635909945020235)] R-squared = 0.36458725864661756
As expected, the performace has increased again, albeit by a very small margin
Now, lets try adding two more predictors – X1^2 and X2^2
Y = B0 + B1*X_inactivity + B2*X_obesity + B3*X_inactivity * X_obestiy + B4*X1^2 + B5*X2^2 + e
Result:
B0: -11.590588857309138 Coefficients: [('INACTIVE', 0.47789170236400547), ('OBESE', 1.489656837325879), ('Obese x inactive', 0.01970764143007776), ('Inactivity sqaure', -0.01973523748870601), ('Obesity square', -0.04942722743255474)] Score = 0.38471232946078504
The score has improved yet again.
It seems this process adding higher powers of the predictor model is an effective way of improving the accuracy of the model, although it can no longer be considered as a linear model and this is now a quadratic model. But using this process infinitely to get a nearly perfect score can lead to overfitting redering the model inefective to predicting new data
To properly validate, we need to test the model accuracy based on a new traiing data but as we have limited data available, other validation techniques must be explored .
In the first class I was introduced to the conecpt of linear regression and how to model a simple predictor function using this technique. My first thought was to code a linear regression model for the CDC-diabetes adat set for each of the Predictive factors, i.e, Diabetes vs Obesity and Diabetes vs Inactivity seperately.
This was more challenging than i had expected because of my limited experience in data analysis techniques with python. I spent cosiderable amount of time trying to merge the data and get it in the form that was most suitable to apply the linear regression model.
Once Ithe data was successfully transformed, it was a straightforward task to get the summarry statistcs of each of the predictors seperately.
I was interesting to observe that the relation between Diabetes and Obesity is more heteroskedastic in nature, i.e, the as the obesity % increases, the variance of the data also increases which is rther counter intuitive as you would expect the county with highere obesity% to have more diabetic people, wheares the relation between Diabetes and inactivity is more homoskedastic which stands to reason
Furthermore, there is a significant positive correlation between the predictors – 75% which is also expected as inactivity tends to cause obesity
I built two linear regression models based on each of the predictors independantly
As expected, the linear regression model built with inacticity is almost twice as good as the one build with obseity due to the more skewed nature of the obesity data
Welcome to UMassD WordPress. This is your first post. Edit or delete it, then start blogging!