ANOVA test (Week 8: Monday)

Analysis of Variance or ANOVA is a statistical method that is used to analyze the differences among group means in a sample. In the washington post data that I have, can it help me determine if there are any significant differences in the ages of people shot across different races? Let’s see.

  • Null Hypothesis (H0): There is no significant difference in the means of age across different races.
    • (In the frequentist interpretation, a small p-value suggests that the observed data is unlikely to have occurred by random chance, leading to the rejection of the null hypothesis. It’s called ‘frequentist’ because it considers probabilities as frequencies of events occurring over repeated experiments. It’s contrasting with the Bayesian approach, where probabilities can also represent degrees of uncertainty.)
  • Alternative Hypothesis (H1): Significant difference in the means of age across different races.

There are few assumptions of ANOVA

  1. The data is normally distributed.
  2. The variance between each groups is approximately equal.

Let’s do the Shapiro-Wilk test for normality and Levene’s test for Homogeneity of Variances.

Lavene's test  

Here the homogeneity of variances is not met and normality is not detected in any major group, it raises concerns about the robustness of the ANOVA results. I still proceeded with ANOVA out of curiosity and obtained the following results.

ANOVA test

Considering the violations of assumptions, I must interpret these results with caution. While the results suggest significant differences, the reliability is questionable although I intuitively believe them after eyeballing the data for a long time.

I am considering additional analyses, such as Welch’s ANOVA or non-parametric tests, to see if the results are consistent across different methods.

Heat Map

I focused on the “armed” and “race” columns when creating a Python heatmap for the police-shootings dataset as part of my investigation. We used matplotlib, seaborn, and pandas to visualise the distribution of racial groups by armed status. I made the heatmap more precise such that it only displayed the values “gun,” “knife,” and “unarmed.” The resulting chart, which was primarily coloured red, provided a clear explanation of these particular armed statuses and racial populations. This helped me understand how heatmaps may be tailored to extract relevant information from large, complicated datasets, improving my data visualisation skills

Choosing The Right Clustering Algorithm

So far I’ve looked into 3 clustering algorithms (DBSCAN, K-means, and Hierarchical clustering). I wanted to know the advantages and disadvantages of using each of them to pick the right one for my project.

K-means: I found this algorithm to be the simplest, it’s easy to understand and implement and is intuitive. I have also noticed that it’s got fast convergence so it’s efficient for large datasets. And it works really well for spherical clusters, when clusters have a roughly equal number of points.

Disadvantages:

  • It is sensitive to initial centroid choice
  • There is an assumption that clusters are spherical. So it may struggle with clusters that are not shaped regularly in the feature space
  • It requires the number of clusters to be specified. Choosing the wrong number can lead to suboptimal results.

Example: Customer segmentation in e-commerce based on purchasing behavior.

DBSCAN (Density-Based Spatial Clustering of Applications with Noise): This algorithm doesn’t require specifying the number of clusters. DBSCAN can discover clusters of different shapes and sizes easily. It is also robust to outliers as it can identify noise points. It doesn’t assume the clusters to be spherical so it works well with complex shapes.

Disadvantages

  • It is sensitive to density variations. If the densities of clusters have varying clusters the performance will degrade.
  • It suffers with high-dimensional data.
  • The choice of parameters (epsilon and minPoints) can influence results.

Example: Identifying clusters in a geographical dataset where the density of points may vary.

Hierarchical Clustering: It produces a hierarchy of clusters which would provide insights at different granularities.
We don’t need to specify the number of clusters here either. It gives you a dendrogram that can be studied for various insights. It can handle irregularly shaped clusters well

Disadvantages:

  • It is very intensive, computationally speaking(for large datasets).
  • It’s sensitive to noise.
  • Interpretability as dendrograms may be difficult to interpret.

Example: Biological taxonomy, where species are grouped based on similarities.

So the choice between these clustering algorithms depends on the characteristics of your data and the goals of your analysis. I am eager to use each of these at least once in my project and compare the results for a better understanding.

Hierarchical Clustering (Week 7 – Friday)

In Hierarchical Clustering, you get clusters very very similar to K-means. In fact sometimes the result can be exactly the same as k-means clustering. The whole process is a bit different though. Two types, Agglomerative and Divisive. Agglomerative is the bottom up approach, Divisive is the opposite. I focused mainly on the Agglomerative approach today.

Step 1: Make each data point a single point cluster, forming N clusters.

Step 2: Take the two closest data points and make them one cluster. That forms N-1 clusters

Step 3: Then take two closest clusters and make them one cluster. That forms N-2 clusters

Step 4: Repeat 3 until there is only one huge cluster.

Step 5: Finish.

Closeness of clusters is different from closeness of data points where you can take measure it using techniques like the euclidean distance between the points.

I learnt about dendrograms, where the vertical axis tells you the Euclidean distance between point and the horizontal axis is for the data points. So the higher the lines, the more dissimilar the clusters are. We can set dissimilarity thresholds and the biggest clusters below the thresholds are what we need by looking at the number of lines the threshold cuts in our dendrogram.

In the dendrogram, intuitively, the largest vertical distance you can take without touching the horizontal lines is usually where the threshold lies.

I used the ward method to calculate the distance, this method merges two clusters and estimates its centroid and looks at the sum of the squared deviations of all the points from the new centroid. Different merge will have different deviations, it picks the merge with the smallest deviation from the new centroid.

Monte Carlo Simulation (Week 7: Wednesday)

I learnt something called as a Monte Carlo Simulation today. Monte Carlo Simulation stands as a mathematical approach employed to approximate the potential outcomes of events characterized by uncertainty, thereby enhancing decision-making processes.

How do they work?: The mechanism involves constructing a model that gauges the probability of diverse outcomes within a system with unpredictability, owing to the intervention of random variables. Leveraging random sampling, the technique generates numerous potential outcomes and computes the average result.

How to run one?: To initiate a Monte Carlo Simulation, a three-step process is followed:

  1. Establishment of Predictive Model: Define the dependent variable to be predicted and identify independent variables.
  2. Probability Distribution of Independent Variables: Utilize historical data to delineate a range of plausible values for independent variables and allocate weights accordingly.
  3. Iterative Simulation Runs: Conduct simulations iteratively by generating random values for independent variables until a representative sample is obtained, encompassing a substantial number of potential combinations.

The precision of the sampling range and the accuracy of estimations are directly proportional to the frequency of sampling. In essence, a higher number of samples yield a more refined sampling range, consequently enhancing the accuracy of estimations

DBSCAN (Week 7 : Monday)

I learnt how and when to implement DBSCAN (Denisty-Based Spatial Clustering for Applications with Noise) today. It is a clustering algorithm used to identify clusters of data points in a space based on their density. It doesn’t really need us to specify the number of clusters beforehand like in k-means. It can discover clusters of aribitrary shapes

A cluster, according to DBSCAN is a dense region of data points separated by sparser regions. It classifies points into three categories (core, border and noise points).

A core point has a minimum number of neighboring points within a specified distance or epsilon. A border point has fewer neighbors than the min_samples but falls within the neighborhood of a core point. The rest are noise points and don’t belong to any cluster. Dbscan randomly selects a data point first and if it’s a core point, it starts a new cluster and puts all it’s neighbors that are reachable to this cluster. These neighbors can be core or border points. Repeat the same process for neighbors, adding their reachable neighbors to the cluster. Continue until no points can be added to the cluster. Now go to an unvisited point and repeat the whole thing. It being robust to outliers is one of it’s main advantage.

The Elbow Method (Week 6 Friday)

Today, I learned a little more about clustering, specifically a concept called the “Elbow Method.” To figure out how many clusters we need I plotted this graph called WCSS, which is a way to see how close the points in a group are to each other. It helped me to figure out the best number of groups where adding another group doesn’t help much, we call this point the “elbow”.

Essentially WCSS (Within cluster sum of squares) is a measure of how tidy our clusters are. Adding up all the squares of the distances between each point and the center of the cluster. Lower WCSS give us neater clusters. As we increase the number of clusters, WCSS usually go down. We have the elbow method to find just the right number of clusters.

I tried experimenting with a dataset of mall customers which, among other things, has their annual income and spending score.

Exploring Analytical Avenues (Week 6 – Wednesday)

To examine several facets of police shootings, I statistically examined the Washington Post data. I considered demographic patterns where I could look at distributions of age and race to find trends. I could investigate the percentage of incidents where people were armed and perhaps the kind of weapons used with the assistance of the armed status. Is there a link between using body cameras and fewer incidents? The role of the police force must be regarded seriously.

During the statistical exploration, I need to keep in mind all the ethical concerns and awareness of potential biases in the dataset, aiming for a fair and impartial interpretation of the results.

Week 6 – Monday

I read a shocking article about police shootings in the US from The Washington Post. Since 2015, they have compiled information about race, the situation, and the mental health of each police officer who was killed while on duty. For improved clarity, the database now includes police agency names and codes. It is routinely updated. The Post’s count is more than double that of the FBI and CDC, highlighting the inadequacy of official statistics. The database focuses on cases comparable to the 2014 killing of Michael Brown in an effort to shine light on police accountability. I am eager to do analysis on the data to obtain more insights

Report – Towards An Analysis Of Factors Affecting Diabetes Across The U.S.

I’m absolutely delighted to present our project report, which examines the examination of data from the Centres for Disease Control and Prevention for 2018 in depth. Our attention to the prevalence of diabetes, obesity, and inactivity in US counties has revealed insightful findings. I’m eager to share what we’ve learned after our journey of exploration!

MTH522 Diabetes Report