Oct 13 – K means clustering for location

K-Means clustering for location

 

K-Means clustering is a powerful tool for location analysis, enabling the grouping of geospatial data points into clusters based on their proximity or similarity. This technique is valuable for segmenting locations, understanding market trends, optimizing resource allocation, and making informed decisions in various fields, from retail and urban planning to healthcare and environmental analysis.

 

First let us do a scatter plot of the latitude and longitude data to get an idea of the data

 

Now let us do the K-means clustering with a cluster size of 100

 

Cluster 1 Center: Latitude -87.77466608391609, Longitude 40.67427972027972
Cluster 2 Center: Latitude -121.12838608695652, Longitude 41.12909739130435
Cluster 3 Center: Latitude -84.15862464985995, Longitude 37.49454341736695
Cluster 4 Center: Latitude -105.33088211382113, Longitude 34.17090650406504
Cluster 5 Center: Latitude -117.77351302083333, Longitude 34.66760546875
Cluster 6 Center: Latitude -82.29514070351759, Longitude 32.234090452261306
Cluster 7 Center: Latitude -95.39037857142857, Longitude 29.66034642857143
Cluster 8 Center: Latitude -74.44290510948905, Longitude 40.786705596107055
Cluster 9 Center: Latitude -76.20114640883978, Longitude 38.916165745856354
Cluster 10 Center: Latitude -106.63796355353075, Longitude 40.951282460136675
Cluster 11 Center: Latitude -149.4990465116279, Longitude 62.47753488372093
Cluster 12 Center: Latitude -81.8052558922559, Longitude 28.721195286195286
Cluster 13 Center: Latitude -156.97926470588237, Longitude 20.828588235294117
Cluster 14 Center: Latitude -94.13923595505618, Longitude 42.08931086142322
Cluster 15 Center: Latitude -122.21451034482759, Longitude 47.371468965517245
Cluster 16 Center: Latitude -97.51444057971014, Longitude 31.395452173913043
Cluster 17 Center: Latitude -88.35704276315789, Longitude 31.789996710526317
Cluster 18 Center: Latitude -81.17313015873016, Longitude 40.23007619047619
Cluster 19 Center: Latitude -94.6525529953917, Longitude 35.52384331797235
Cluster 20 Center: Latitude -112.77275471698113, Longitude 33.27761455525606

 

 

Oct 11 – Police shooting data overview

Police Shooting Data Overview

We have been given the report on police shooting data by the Washington Post. Today, we will plot the trends in the significant columns to gain some basic insights into the data

Columns: [‘id’, ‘name’, ‘date’, ‘manner_of_death’, ‘armed’, ‘age’, ‘gender’,
‘race’, ‘city’, ‘state’, ‘signs_of_mental_illness’, ‘threat_level’,
‘flee’, ‘body_camera’, ‘longitude’, ‘latitude’, ‘is_geocoding_exact’]

 

 

Week 4: 2 Oct – Understanding T-test II

Understanding T-test II

In the last post, we saw what a T distribution and Normal distribution is. Now, let us lok at some key features

In a normal distribution, data tends to cluster around the mean (68% of data lies within i standard deviation around the mean and 99% of data lies within 3 standard deviations). As one moves farther away from the mean, the frequency of data points decreases exponentially. This phenomenon implies that the probability of an event occurring is closely tied to its proximity to the mean value. This correlation is of paramount importance because it underscores the mean’s effectiveness as a precise descriptor of the distribution.

Understanding the mean value provides valuable insights into the population and its behavior. This is precisely why normality is crucial for conducting a T-Test. When dealing with a population that does not exhibit a normal distribution, there is no assurance that the population mean carries inherent significance on its own. Consequently, knowledge of the mean may provide little to no meaningful information about the dataset. In such cases, conducting a t-test becomes a futile exercise because determining whether the difference in means is statistically significant offers no meaningful insights when the means themselves lack significance.

 

Central Limit Theorem

The central limit theorem states that as we sample data from any population, regardless of the population distribution, the samples’ means tends towards a normal distribution as the sample size increases, i.e, Given a sufficiently large sample size from ay distribution, the sample means will be normally distributed

The Central Limit Theorem plays a pivotal role in the widespread application of T-tests. As previously discussed, T-tests are most effective when applied to populations that exhibit a normal distribution. However, according the Central Limit Theorem, for any given population, if we collect a sufficiently large number of random samples from it, the cumulative distribution of sample means tends to follow a normal distribution. This phenomenon allows us to apply T-tests to the derived sample population, even when the original population may not be normally distributed.

 

 

Week 3: 29 Sept – Understatnding T-test I

Understatnding T-test

We had already explored T-Test and its role in understanding the statistical significance of a distributions mean. For a t-test to have a menaingful result, the distrivutions must satisfy the following conditions:

    • The data sets must be normally distributed, i.e, the shape must resemble a bell curve to an extent
    • are independent and continuous, i.e., the measurement scale for data should follow a continuous pattern.
    • Variance of data in both sample groups is similar, i.e., samples have almost equal standard deviation

Today, we will deep dive into these to understands why these conditions are necessary. But first, le us understand what a T-Distribution is

T Distribution

  • The t-distribution, also known as the Student’s t-distribution, is a probability distribution that is similar in shape to the standard normal distribution (bell-shaped curve).
  • The key feature of the t-distribution is that it has heavier tails compared to the normal distribution. The shape of the t-distribution depends on a parameter called degrees of freedom (df).
  • As the sample size increases, the t-distribution approaches the standard normal distribution.
  • In hypothesis testing with the t-test, the t-distribution is used as a reference distribution to determine the critical values for a specified level of significance (alpha) and degrees of freedom.

 

Normal distribution (z-distribution) is essentially a special case of t distribution. But whats important for us are certain properties that are common to both but is more prominent in the normal ditribution

 

 

Week 3: Sept 27 – Breusch-Pagan Test

 

Breusch-Pagan Test

In the world of data science and regression analysis, the Breusch-Pagan test is like a detective tool that helps us investigate an important issue called “heteroscedasticity.” Let me break it down for you.

Heteroscedasticity is a fancy term for a situation where things are not as tidy as we’d like in a regression analysis. Specifically, it’s when the spread of your residuals)changes as you move along the independent variables. High heteroscedasticity would mean that your prediction errors (the differences between your predictions and the actual values) vary differently across different inputs. Some predictions might be pretty close, while others are way off.

the Breusch-Pagan test is used for detecting this variability issue. Here’s how it works:

  1. Build your regression model: You start by creating a regression model that tries to predict something, like housing prices.
  2. Calculate residuals: Residuals are the differences between your predictions and the actual prices for each house.
  3. Squared Residuals: You square those residuals. This step emphasizes larger errors more than smaller ones.
  4. Second Regression: Next, you build a new mini-regression model. This time, you use the squared residuals as your “dependent variable” (the thing you’re trying to predict), and the same predictors you used in your original model.
  5. Hypothesis Testing: You perform a hypothesis test to see if your predictors are related to the squared residuals. If they are, it’s a sign that heteroscedasticity might be present.

If the Breusch-Pagan test suggests heteroscedasticity is happening, it means our original regression model isn’t performing as well as we thought.

So, as data scientists, We would want to investigate further, maybe try different modeling techniques or transform your data to make the errors more consistent across the board. The goal is to have a model that’s as accurate as possible for all cases, not just some.

In a nutshell, the Breusch-Pagan test helps us spot when the “scatter” of our errors isn’t the same for all data points, and that’s a signal for us to dig deeper and refine our models.

Week 3 – Sept 25th: Plotting polynomial regression models

Plotting polynomial regression models

Today, we will attempt to plot polynomial regression models of differrent degrees and compare the regression lines and the R-squared values We will be using the Inactivity vs Diabetets data to perform this analysis as it has the maximum number of data points available.

The Polynomial regression is performed using the sklearn package that provides the inbuilt function PolynomialFeatures()  that allows us to model the conersion matrix with the polynomial degree as parameter

 

Then we create the regression model using the LinearRegression() function and fit the model to our data. Once the model is created, it is a fairly straight forward process of using the model to predict the values and use this prediction to calculate the r-quared values for each degree model

 

Outputs:

 

 

As expected, the r-squared values shows slight progressive improvement with each successive degree, but at the same the its is clear from the plot that the model progresively tends to be overfitted making it less effective in predictions