Let’s dive into the statistical nuances of the random forest algorithm. Essentially, a random forest is an ensemble learning technique that works by building a large number of decision trees during training and producing the mean prediction (regression) or the mode of the classes (classification) of the individual trees. To break it down statistically:
- Decision Trees
- In essence, every tree in the forest is a sequence of binary choices made in response to input data. In order to make these choices, the best split is chosen at each node. Typically, entropy or Gini impurity are used for classification, and mean squared error is used for regression. They are highly sensitive to training data. This could result in high variance
- Bootstrapping
- By training each tree using bootstrapped samples of the original data, random forests provide unpredictability to the system. Bootstrapping is the process of establishing several datasets, some of which may contain duplicates, by sampling and replacement.
- Feature Randomness:
- The approach takes into account only a random selection of features at each split in a tree, not all of them. By doing this, the trees become more diverse and overfitting is less likely to occur.
- Voting or Averaging:
- A majority vote among the trees determines the final forecast for classification. It is the mean of all the trees’ predictions in regression.
- Correlated Trees:
- Both the randomness introduced during tree construction and the inherent randomness in the data influence the correlation between any two trees. This association is advantageous since it lowers overfitting and enhances prediction performance overall.
- Out of Bag Errors:
- There are data points that are not part of every tree’s training set because every tree is trained using a bootstrapped sample. These out-of-bag samples can be used to estimate the performance of the random forest without a separate validation set.
- Tuning Parameters
- The number of trees, the depth of each tree, and the size of the feature subsets utilised at each split are some of the characteristics that affect random forests. To maximise the random forest’s performance, certain settings must be tuned.
I used the random forest algorithm to see if I can predict the armed status in incidents using the age and race. The Random Forest model achieved an overall accuracy of 57% in predicting the ‘armed’ status. It excelled in identifying instances of ‘gun’ (97% recall, 73% F1-Score), but struggled with other classes, often resulting in 0% precision, recall, and F1-Score. The classification report suggests limitations in recognizing diverse ‘armed’ scenarios. Improvements may involve hyperparameter tuning, addressing class imbalances, and exploring alternative models. Or maybe if I combine all kinds of arms and encode it, the model would perform better.
The Metrics:
- Precision: It represents the ratio of true positive predictions to the total predicted positives. For example, for the class ‘gun,’ the precision is 59%, meaning that 59% of instances predicted as ‘gun’ by the model were actually ‘gun’ incidents.
- Recall (Sensitivity): It represents the ratio of true positive predictions to the total actual positives. For ‘gun,’ the recall is 97%, indicating that the model captured 97% of the actual ‘gun’ incidents.
- F1-Score: It is the harmonic mean of precision and recall. It provides a balanced measure between precision and recall. For ‘gun,’ the F1-Score is 73%.