San Francisco Crimes Classification

Project overview

I created a Machine Learning model in R language to classify the type of crime based on time and location where it was committed. The project is a part of Kaggle competition and it is based on publicly available dataset https://www.kaggle.com/competitions/sf-crime.

Code: https://github.com/Nik-Kras/Kaggle-San-Francisco-Crime

Business Value

The model that can classify a type of crime by time and location can be used to recover damaged data. It will perform much more accurately than filling fields with mean or median values.

Moreover, this model could be developed into much more valuable product. An AI that can say in which time and location a specific type of crime could happen. This model can help police to reduce huge number of crimes and, which is more important, prevent many crimes by being in the right time and place.

Technical details

I applied KNN for clustering and Random Forest for classification to the Kaggle dataset. Dataset consisted of 878,049 training samples and 884,262 testing samples, which applies memory-efficient requirements as dataset is big. Moreover, the dataset had 39 types of crimes, so prediction of crime type actually is 39-class classification and the distribution is highly skewed. Around 20% of crime types have 80% of all cases.

Crime commitment count crime type-wise

The Figure to the left shows the distribution of crime types or target variables. In general, there are 39 classes or unique types of crimes. However, some appear more frequently than others. Obviously, robbery happens more often than assault.

I checked how different feature engineering techniques improve results. I applied KNN clustering on coordinates which I used as additional features. My experiments showed that around 5 clusters is an optimal decision to improve ML performance. Also, my research explored feature importance as in Figure below.

Feature importance Figure

At first it might not seem informative, but actually it shows that Minutes are the most important time-feature to classify the type of crime. And the research showed that it happens not due to special crime patterns, but because that is how crimes are recorded. If the tiny crime is committed – police usually rounds the time to 0, 15, 30 or 45 minutes. But when the crime is serious, like a murder, then police writes minutes precisely. Therefore, if the data is corrupt, the most important feature to look for is Minutes. The second most important feature is a please, meaning that special places have different distribution of crimes in the area.

Results

I created a Random Forest ML model to predict type of crime by given time and location. I applied feature engineering, including positional clustering with KNN, to improve results. Then I applied Random Forest to get feature importance and perform crime classification with SVM and Random Forest models. The model achieved log loss of 2.907 while the best-known solution on Kaggle has log loss of 1.959

Future Works

There are many more feature engineering techniques that can be applied to improve performance. For example, one of the most promising ideas is to make not positional clustering, but space-time clustering. It make sense as certain types of crimes happen in the specific are only in certain range of time, like robbery that usually happens at night in dangerous areas. Also, it makes sense to try different ML models, like Neural Networks. Overall, project has many possible ways of further development.

Previous
Previous

User Registration on Distributed Data Base

Next
Next

Smart Traffic Light