A basic random forest classification example
In this section there is a brief example of solving a classification task using a random forest. We'll load the titanic survivor data from (openml) to attempt to predict whether the passenger survived or not based on the data available. This is a basic example to show how to use a RandomForest model using tensorflow-decision-forests so we won't go deep into the details or apply much pre-processing. If you want to know about decision trees then you can see the relevant topic here
This is a basic workflow to load, setup, and evaluate a random forest classifier. The steps involved are:
- Data input and test/train split creation
- Instantiate a RandomForest model
- Train the model
- Evaluate the model on the test dataset
First, import the necessary libraries
Data input and test/train split creation
We'll load a toy dataset using the openml connector in sci-kit learn which has various datasets available. We'll use the titanic survivor dataset which has 14 features and 1309 samples. Note that some features have null values for some samples but for the sake of simplicity we'll just drop those rows. More information can be found here. We'll also use the train_test_split function from sci-kit learn to make our train and test datasets.
Here we have split the data into train and test datasets using the train_test_split function from sklearn.model_selection. We also use a fixed random state so that the training and test splits are consistent when we re-run the script
Instantiate a RandomForest Model and train
Tensorflow-decision-forests have a number of prebuilt models that can be easily instantiated. Here, we use a RandomForestModel solution. We will be training a Random Forest model which will attempt to map the input data to the target labels, we'll leave the details of how it does this to the relevant page here. We'll set the number of tress of 50 as the default of 300 is quite large for the simple dataset we are using here. Similarly the default depth of 16 is also rather quite deep for this simple problem, so we'll limit that to 1 less than the number of features.
Evaluate the model on the test dataset
We'll evaluate the model on the test dataset and retrieve the true/false positive/negative values to form a confusion matrix (although we'll just print the values to the screen).
loss: 0.0 accuracy: 0.7709923386573792 true_positives: 52.0 true_negatives: 150.0 false_positives: 12.0 false_negatives: 48.0
Summary
In this topic we've gone through a simple example of building a Random Forest classifier. Next, have a look at the details behind how decision trees and random forests work:
Introduction to Decision Trees
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.