Scientists Simplifying Science

Using A Big Data Lens to Understand Gun Violence

SHARE THIS

Editor’s NoteLooking for a simple and easy to understand example of how to use machine learning?  Looking to understand how to explore data? In his current blog, Pawan Nandakishore walks us through building a story around a dataset and training a decision tree to predict gun deaths in America.  Read on to see what are some interesting conclusions that come to light!- Anshu Malhotra  

Decision trees are amazing tools to easily interpret a model since they split the data based on certain thresholds for each column of the data matrix. They are easy to train with few hyper-parameters and are easy to interpret. They do have a major drawback, in that they tend to overfit the data. Despite this, as the first example of a machine learning algorithm, decision trees are intuitive and can give a valuable insight into the data which can be used to build better models. The goal of this analysis is to create a model using a decision tree on a dataset.

 

Introduction to the dataset

 

I will use the sklearn algorithm to train a decision tree. As you will see, many more lines are spent preparing the data for training rather than the actual training process (which really is just one line). The dataset we are going to use can be found here  https://data.world/azel/gun-deaths-in-america . This dataset is part of a project called “five thirty-eight’s gun deaths in America”. It contains a bunch of information about victims of gun violence. Each row of the dataset contains the year and month of the shooting, the intent of the shooter, if cops were at the scene or not, the gender, age, race, level of education of the victim and finally, the place where the shooting took place. There is specific information about whether the victim was Hispanic or not. We take this dataset and boil it down to predicting just one of the two classes: 1. Were the victims of the shooting white or African-American? 2. Why ignore the other victim classes (There are 5 in total)? As you will see, all the remaining classes make up less than 11% of the dataset, and our goal is to build a simple binary classification model. For those who are interested, I would love to work with people to build a model with a more detailed classification like for example, the dataset from CDC dealing multiple causes of death.

The plan for the analysis is as follows:

– Read and display the dataset to see what the relevant columns are.

– Encode certain categorical variables so the decision tree can be run on them.

– Plot some of the categorical variables to see how skewed they are.

– Drop rows containing non-white and non-African-American victims.

– Create test and train sets.

– Train the decision tree.

– Interpret the results of the tree: I will leave this as a set of questions to be explored by an interested reader, who can get further involved with understanding what the model represents.

To start off, here are the first few rows of the dataset

 

Next, I will go one by one looking at each feature to better understand the data. I will start with the race column.

In order to assess the values in the race class, we need to retrieve the value counts for each of the class variables. This number represents the number of victims of each race. Following retrieval, it is observed that a majority of the victims were either African-American or White. We can convert these numbers into percentages and plot them for visualization as well.

From the bar plot above, it is obvious that a majority of the data is for black or white victims. Here, I would like to point out that we can also make a multiclass problem by sampling the same number of rows as that of the Hispanic victims to create a 3 class classification problem. I chose to just stick with binary classification problem since this is more for demonstration purposes than hardcore analysis of the dataset.

Personally, I believe that the complete version of this dataset requires a more detailed analysis so that we can spot some trends on what kind of gun deaths or crimes are prevalent. Based on this, if we can build a good prediction model, it can hopefully help law enforcement understand the nature of the problem better. This really is the whole point of the data-driven approach. But I digress.

I ended up visualizing most of the categorical columns in the dataset since there really aren’t too many columns. I guess one can put them in a subplot, but I wanted the x and y-axis to be legible.

We find that Suicide and Homicide are the most commonly observed causes of death. In comparison to these, accidents are rare. This tells us that we must find better ways of addressing both these crime types. Next, we look at the Police column

We find that in almost all cases there is no police presence. As a general observation, we know that crimes are committed more frequently in places where police are not already present. looking at the previous feature called “intent”, we can infer that neither suicide nor homicide are easily preventable by increasing police presence. This may explain why the police are not present during most shootings.

Looking at shootings by gender, we find that majority of the victims tend to be men rather than women. What would add value to this analysis is also the gender of the perpetrators of the crime.

Next, we look at the education level of our victims. We find that most victims tend to possess lower levels of education. We can perhaps better understand the intent of shooting by seeing correlations between education level and intent.  

Another important feature is the place where these shootings occur

Here we find that most of the shootings occur at home, and this perhaps explains the imbalance in the presence of police at the site of the shooting.

Next, we look at month and year columns and find that there is no real spike in shootings at any specific time of the year. It could just be that the dataset does not have the resolution to capture such a phenomenon.

From the above, we can basically drop the police column since it will not contribute to any classification. One can keep it as well, I do not expect it to make much of a difference, but since it is really insignificant to the current analysis, it is just better to leave it out. There are other columns which are also fairly skewed like ‘sex’ but we shall keep them.

Just to summarize the above points-

Most victims are males

Most gun deaths occur at home, about 10% occur on the streets, and these could be either lone wolf or gang based shooters.

–  Number of shootings tend to be fairly uniform across the year

–  There are fewer victims with higher education levels. For example, level 4 and 5 together make up for less than 15% of all the victims, while education levels 1 and 2 account for over 60% of all the victims.  However, we need more data to understand the reasons behind this.

Now that we have some understanding of the data, we make the following modifications to the dataset-

  1. Converting categorical variables to integer labels  (0,1, 2 etc)
  2. Dropping non-white victims since we are trying to build it up as a binary classification problem
  3. Dropping the police column since it adds no value to the analysis due to large class imbalance

We then do a 70% / 30 % split for the training/testing and train a decision tree using sci-kit learn.

Training a decision tree

This is where I use the decision tree classifier. As you can see, it’s oh so conveniently a one-liner. The real work was above, where we had to get the training set and the test set, and once we have those, we can throw any kind of classifier we want, at it. We make predictions using the predict function of the model. The accuracy is fairly easy to calculate: we can equate y_preds to test_y and sum it up, recall and precision are also fairly easy to calculate. I will refer you to the wiki article on recall and precision to know what they are: https://en.wikipedia.org/wiki/Precision_and_recall . They have done a pretty nice job of explaining it. We get an accuracy of about 87 percent. Which is not too shabby since it is a single decision tree. One thing that I did notice was, it is better to have a shallow tree, (in my case its 8 units deep) vs something that is deep since deeper trees would mean that there is a greater chance of overfitting the data.

Since we are essentially dealing with a binary classification, we can use recall and precision scores as well. I get a recall of 92 % and a precision of 91%.

We can actually visualize the decision tree itself, please refer to the link to the jupyter notebook at the bottom of the article which takes you to the original code where you can generate a PNG of the decision tree. Such a visualization is helpful in understanding what type of splits the decision tree makes in order to arrive at the final result

So now we have a decision tree based model that can use a few parameters to predict whether the victim of a gun death will either be a black person or a white person. What next? We can use the tree to figure out what are some of the important features responsible for deciding whether the victim will black or white. ‘feature_importances’ gives us an array that scores each of the features, the array sums to 1. Using this we can decide which are the relevant features we should worry about when deciding the race of the victim. Let us also extract the column names of the test or the training dataset so we can plot a bar plot with feature_importance

This is interesting that for black and white victims, the deciding factor seems to be the intent, and the other three factors being age, gender, and place. The education levels surprisingly seem to have very little effect on deciding the race of the victim. There are ways that one can extend the analysis further by studying which class values in the intent column are relevant to the race of victims. For example, the five thirty-eight has articles on how suicides are prevalent among middle-aged men (https://fivethirtyeight.com/features/suicide-in-wyoming/) and homicides are prevalent amongst black men (https://fivethirtyeight.com/features/homicide-in-new-orleans/). The latter is a well-documented phenomenon, the prevalence of gang violence, drug-related killings and aggression driven gun deaths have caused homicide rates in cities like New Orleans and Baltimore to be very high

For technical information –

The details of the analysis are there in the jupyter notebook below. Please do check it out. Further analysis is also presented here.

Jupyter Notebook – https://github.com/pawan-nandakishore/Playing_around/blob/master/ATF_analysis/Final_538_dataset_analysis-stat-1.ipynb

Gini criterion – Elements of Statistical Learning

Decision tree classifier – http://scikit-learn.org/stable/modules/tree.html

The scikit learn documentation is actually pretty decent

CDC deaths due to multiple causes – https://www.cdc.gov/nchs/data_access/VitalStatsOnline.htm#Mortality_Multiple                            

                           

 

About the Author:

Pawan Nandakishore  is an Insight Data Science Fellow. He is a PhD from the Max Planck Institute for Dynamics and Self Organization. His background is experimental physics, with a specialization in soft matter. Do feel free to write to him at pawan.nandakishore@gmail.com. He is an avid gamer and enjoys Sci-Fi movies

 

Reviewed By:

Abhik Seal  is a Senior Data Scientist at Abbvie and has extensive experience in Data Science and Cheminformatics.  Abhik’s goal is to impact therapeutic development by designing scalable computational methodologies that integrate heterogenous data types from multiple biological scales, to generate a global picture of the mode of small molecules in biological systems. He is particularly interested in linking chemical structure information to molecular, bibliographic genomic and clinical covariates to enable the study of polypharmacology and repurposing. He provides mentoring to CSGians interested in the field of Data Science. 

Edited by:

 

Anshu Malhotra is an assistant scientist at Emory University and she is strongly involved in co-ordinating the activities in CSG’s flagship mentor-mentee program (Gurukool). She is actively involved in bench-based research in pediatric oncology and is strongly interested in developing skills in data science. CSG’s current venture, the Data Science club is Anshu’s latest passion and she hopes that this platform will bring more life scientists together to train themselves and network in this budding new profession. In her spare time, she dabbles into artwork of 3D murals.

 

Cover Image: Wikimedia By Camelia.boban – Own work, CC BY-SA 3.0, and Pixabay

 

The contents of Club SciWri are the copyright of PhD Career Support Group for STEM PhDs (A US Non-Profit 501(c)3, PhDCSG is an initiative of the alumni of the Indian Institute of Science, Bangalore. The primary aim of this group is to build a NETWORK among scientists, engineers and entrepreneurs).

This work by Club SciWri is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License

                          

SHARE THIS

The contents of Club SciWri are the copyright of Ph.D. Career Support Group for STEM PhDs (A US Non-Profit 501(c)3, PhDCSG is an initiative of the alumni of the Indian Institute of Science, Bangalore. The primary aim of this group is to build a NETWORK among scientists, engineers, and entrepreneurs).

This work by Club SciWri is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Tags

Latest from Club SciWri