Information on Random Forests
Random forests are machine learning models that use an ensemble of classification trees (with categorical response variables) or regression trees (with continuous response variables) to provide predictions. The term random is used because two forms of randomness are introduced when a tree is fit:
Typically, many trees (such as 500) are trained and make up the forest. To get predictions, the random forest obtains a prediction from each tree and either
The diagram below shows a very simple example of a random forest for classification. The model has four predictor variables and a categorical response variable with three levels (species). The random forest is made up of three trees. The circles in the trees represent the features chosen by the tree, and the rectangles represent the classification at the end of a path. The bold lines represent the paths corresponding to an observation of interest. In a classification example such as this, the random forest returns two quantities:
For more information on random forests, see the following resource: Cutler et al. (2007)
WhoseEgg uses three random forest models (one for each taxonomic level). The models are similar to the augmented models described in Goode et al. (2021) and based on the models developed in Camacho et al. (2019). The models, code for training the models, and the training data are available on the GitHub repository for WhoseEgg:
Model structures:
Response variables of random forest models (all three group Bighead, Grass, and Silver Carp as one category called invasive carp):
Predictor variables: