A basic random forest classification example

In this section there is a brief example of solving a classification task using a random forest. We'll load the titanic survivor data from (openml) to attempt to predict whether the passenger survived or not based on the data available. This is a basic example to show how to use a RandomForest model using tensorflow-decision-forests so we won't go deep into the details or apply much pre-processing. If you want to know about decision trees then you can see the relevant topic here

This is a basic workflow to load, setup, and evaluate a random forest classifier. The steps involved are:

Data input and test/train split creation
Instantiate a RandomForest model
Train the model
Evaluate the model on the test dataset

First, import the necessary libraries

import tensorflow_decision_forests as tfdf
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_openml

Data input and test/train split creation

We'll load a toy dataset using the openml connector in sci-kit learn which has various datasets available. We'll use the titanic survivor dataset which has 14 features and 1309 samples. Note that some features have null values for some samples but for the sake of simplicity we'll just drop those rows. More information can be found here. We'll also use the train_test_split function from sci-kit learn to make our train and test datasets.

# load the dataset from tensorflow datasets
df = fetch_openml(data_id=40945)
df = df.frame

# cast the survived column to an int and drop any row with null values
df['survived'] = df['survived'].astype(int)
df = df.dropna(axis=1, how='any')

# split the data into a train and test set
train_df, test_df = train_test_split(df, test_size=0.2)

Here we have split the data into train and test datasets using the train_test_split function from sklearn.model_selection. We also use a fixed random state so that the training and test splits are consistent when we re-run the script

Instantiate a RandomForest Model and train

Tensorflow-decision-forests have a number of prebuilt models that can be easily instantiated. Here, we use a RandomForestModel solution. We will be training a Random Forest model which will attempt to map the input data to the target labels, we'll leave the details of how it does this to the relevant page here. We'll set the number of tress of 50 as the default of 300 is quite large for the simple dataset we are using here. Similarly the default depth of 16 is also rather quite deep for this simple problem, so we'll limit that to 1 less than the number of features.

model = tfdf.keras.RandomForestModel(verbose=2, num_trees=50, max_depth=(len(df.columns)-1))
model.fit(train_ds)

Evaluate the model on the test dataset

We'll evaluate the model on the test dataset and retrieve the true/false positive/negative values to form a confusion matrix (although we'll just print the values to the screen).

model.compile(metrics=["accuracy","TruePositives","TrueNegatives","FalsePositives","FalseNegatives"])
evaluation = model.evaluate(test_ds, return_dict=True)

for k, v in evaluation.items():
  print(f'{k}: {v}')

loss: 0.0 accuracy: 0.7709923386573792 true_positives: 52.0 true_negatives: 150.0 false_positives: 12.0 false_negatives: 48.0

Summary

In this topic we've gone through a simple example of building a Random Forest classifier. Next, have a look at the details behind how decision trees and random forests work:

Introduction to Decision Trees

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.