General Information

Random forests are machine learning models that use an ensemble of classification trees (with categorical response variables) or regression trees (with continuous response variables) to provide predictions. The term random is used because two forms of randomness are introduced when a tree is fit:

  1. Each tree in the ensemble is trained using an independent random bootstrap sample from the training data.
  2. When a variable is being chosen for a split in a tree, only a randomly selected subset of predictor variables are considered. For example, when the WhoseEgg models were trained, the number of predictor variables considered at a split was equal to the square root of the total number of predictor variables.

Typically, many trees (such as 500) are trained and make up the forest. To get predictions, the random forest obtains a prediction from each tree and either

The diagram below shows a very simple example of a random forest for classification. The model has four predictor variables and a categorical response variable with three levels (species). The random forest is made up of three trees. The circles in the trees represent the features chosen by the tree, and the rectangles represent the classification at the end of a path. The bold lines represent the paths corresponding to an observation of interest. In a classification example such as this, the random forest returns two quantities:

  1. A probability for each response variable level.
    • In the example below, the probability for species 1 is 2/3 since two of the three trees returned a prediction of species 1.
  2. A prediction.
    • In the example below, the prediction is species 1 since it is the species with the highest random forest probability.

For more information on random forests, see the following resource: Cutler et al. (2007)

Random Forests in WhoseEgg

WhoseEgg uses three random forest models (one for each taxonomic level). The models are similar to the augmented models described in Goode et al. (2021) and based on the models developed in Camacho et al. (2019). The models, code for training the models, and the training data are available on the GitHub repository for WhoseEgg:

Model structures:

Response variables of random forest models (all three group Bighead, Grass, and Silver Carp as one category called invasive carp):

Predictor variables: