Capstone project

My Capstone Project — Car Accident Severity


Introduction: Business Problem 

In this project, the car accident severity is investigated with the provided data based on the car accident records in Seattle.

It is apparent that several factors have an impact on the accident. However, the current project cannot account for all different factors due to the limitation of records and computational cost. The main factors including the objective conditions (e.g. weather, light, etc.) and subjective conditions (e.g. speeding, drugs, etc.) are considered in the study. Based on the severity category in the record, property damage and injury are considered in the current study.

This project analyzes the correlation between those factors and the car accident severity. Finally, a suitable model is provided to predict car accident severity with the given objective and subjective condition.

Data 

Based on the introduction and objection of this project, both objective conditions and subjective conditions leading to the car accident severity are considered.

Objective condition:

  • WEATHER: A description of the weather conditions during the time of the collision.
  • LIGHTCOND: The light conditions during the collision.
  • ADDRTYPE: Description of the general location of the collision.
  • VEHCOUNT: The number of vehicles involved in the collision.

Subjective condition:

  • SPEEDING: Whether or not speeding was a factor in the collision.
  • UNDERINFL: Whether or not a driver involved was under the influence of drugs or alcohol.

Target:

  • SEVERITYCODE: A code that corresponds to the severity of the collision (1 –> Property damage; 2 –> Injury).

All the data from the record is used to form a classification model and predict the accident severity by given data.

Methodology 

In this project, the effort is focused on the car accident severity prediction. In other words, by given the objective and subjective conditions, the prediction result of severity (i.e. property damage or injury) is provided by this model.

In the first step, the required data including all those conditions and related accident severity is collected from the project website. The data is inspected and cleaned to keep only the useful information.

The second step uses the cleaned data to check the correlation between those factors and accident severity. The classification model is provided in this step as well. The model is trained and test with the data from records. Meanwhile, the prediction accuracy of the trained model is also provided in this step.

Results and Discussion 

The initial data set is very large. As shown in Fig.1, A total of 38 features can be found in the records. However, due to the limitation of the computational resources, a few features are selected to predict the car accident severity.

image-2.png

Fig. 1: Head of the recordsFig. 1: Head of the records

As discussed in the previous section, the following features are selected, namely WEATHER, LIGHTCOND, ADDRTYPE, VEHCOUNT, SPEEDING, UNDERINFL, as shown in Fig.2. It should be noted that the first four features are related to the objective conditions which is not controllable by the driver. However, the last two features (e.g. SPEEDING and UNDERINFL) can be controlled by the driver. Therefore, these two features are considered as subjective conditions.

image-3.png

Fig. 2: Selected feature

The data cleaning process is by selecting those features and replace the null data in the data frame. In the current study, the “Other” and “Unknow” weather in WEATHER features are combined as “UnknowWeather”. A similar process is applied to the LIGHTCOND and ADDRTYPE features. Furthermore, the data in VEHCOUNT, SPEEDING, and UNDERINFL features are transformed into an integer for simplicity.

Before the classification process, a simple influence is studied with different bar plots. Fig. 3 shows the number of vehicles involved in a collision under different weather conditions. The results clearly show that majority of car accidents occur on a clear day, and the number of cars involved in the accidents is 2. The accidents that occurred under severe weather can almost be ignored. Moreover, if the number of cars involved in the accident increases to 3, the majority of accident severity is only property damage indicating the accidents including several cars are less fetal than that of two cars.

image-4.png

Fig. 3: Impact of VEHCOUNT in different weather conditions

Fig. 4 shows the number of vehicles involved in a collision under different light conditions. Obviously, the majority of accidents occur in daylight. However, when the light conditions come to dusk or dawn, the car accident severity increase to injury.

image-5.png

Fig. 4: Impact of VEHCOUNT in different light conditions

Fig. 5 shows the number of vehicles involved in a collision under different address types. The results clearly show that the accident has a high occurrence probability in the block. Meanwhile, the car accident severity of the injury is relatively low for those accidents that occurred in the block. As a comparison, car accidents that occurred in the intersection has a relatively high injury probability.

image-6.png

Fig. 5: Impact of VEHCOUNT in different address type conditions

Fig. 6 and Fig. 7 show the car accident severity of speeding and driving under influence (DUI). Although the majority of the car involved in the accidents are not under speeding or DUI. The injury probability increases with the impact of speeding or DUI.

image-7.png

Fig. 6: Impact of SPEEDING

image-8.png

Fig. 7: Impact of UNDERINFL

In order to predict the car accident severity under different combinations of features. Classification models are employed in the current project. Due to a large number of samples in the records, the first choice is the decision tree model since the classification of the decision tree is very fast. As a comparison, the logistic regression model is also used in the current project.

Both the decision tree model and the logistic regression model need the training and testing samples. Therefore, the entire sample set is divided into training and testing set with a rate of 0.3. The training results are shown in Fig. 8. It is clearly shown that the decision tree has a higher Jaccard score. However, the F1-score of the logistic regression model is higher than the decision tree model. By accounting the computational time, the decision tree model is recommended since it needs less time for training.

image-9.png

Fig. 8: Traning and prediction results

Conclusion 

In the Capstone Project, the decision tree model and a logistic regression model are used to classify the car accident severity with the provided data. From the results, the speeding and driving under influence (DUI) increase the car accident severity. In those features, the injury rate is higher than those without speeding and DUI.

The classification process is carried out with two models, namely the decision tree model and logistic model. Both models can be used to predict car accident severity under different features. Based on the accuracy measure, the decision tree model is slightly better than the logistic regression model. Further, the calculation time shows the decision tree model needs much less time than that of the logistic regression model which can be employed in this project.