1. Installation

We will begin by installing devtools. Devtools is an R package that enables the development of other R packages. You can use devtools to install a package that is managed in a github repository. We will install CAMI this way.

## install devtools with the line below if you do not already have it
install.packages("devtools")
library(devtools)

## install CAMI from github using devtools with the line below
devtools::install_github("ruffleymr/CAMI")
library("CAMI")

To see the functions available in CAMI, or and R package for that matter, you can use name of the funcion followed by :: and tab complete. If we do this for CAMI, CAMI::, we can see that there are three functions. Use ? before the name of each function to see the help documentation.

For full tutorial using CAMI on an empircal dataset, see CAMI tutorial.

2. Simulate Data

To simulate community assembly data we will be using the SimCommunityAssembly() function. This function simulates phylogenetic and phenotypic community assembly data in a species-based model. The function takes several parameters, which can be seen in the help documentation using ?SimCommunityAssembly. This function can be used to simulate 1 to many community assembly datasets, set using the sims argument.

For a single simulation, first, a regional community phylogeny is simulated under a constant birth-death process. CAMI uses the function sim.bd.taxa in the R package TreeSim (Stadler 2011).

  • N determines the size of the regional species pools
  • lambda is the speciation rate in species/million years
  • eps is the extinction fraction, which is used to calculate the extinction rate from the speciation rate

Trait evolution is then modeled along this phylogeny under one of two models of trait evolution, Brownian Motion (BM; Felsenstein 1985) and Ornstein-Uhlenbeck (OU; Hansen 1997, Butler & King 2004), determined by the traitsim argument.

  • traitsim either “BM” or “OU”
  • sig2 is the instantaneous rate of character change for either model
  • alpha controls the strength of pull to the trait optimum and is for OU models only

Once the regional community pool exists, the species are assembled into local communities under one of three assembly models, neutral, habitat filtering, and competitive exclusion, set using the comsim argument. The non-neutral assembly models use the phenotypic variation in the regional community and phenotypic matching and repulsion equations, for filtering and competition models respectively, to determine which species will survive in the local community.

  • comsim either “neutral”, “filtering”, “competition”, or “all”
  • tau controls the strength of the non-neutral assembly processes

For filtering models, the lower tau is, the stronger the environmental filter. For competition models, the higher tau, the stronger the effects of competition.

For all of these parameters, a uniform prior distribution of parameter values or a fixed value can be designated. Many of the parameters have default uniform prior distributions, see the help page for details on each parameter.

We will start with one simulation under a neutral model of assembly and BM trait evolution. The regional community pool will consist of 100 species and the local community pool will only have 50. All other parameters will be kept under their default uniform prior ranges.

## simulate 500 community datasets under all models of assembly
## leave lambda, eps, sig2, and tau with their default prior ranges
%%R
Assembly.Data <- SimCommunityAssembly(sims = 1, 
                                     N = 100, 
                                     local = 50,
                                     traitsim = "BM",
                                     comsim = "neutral",
                                     ## do you want summmary stats calculated?
                                     output.sum.stats = TRUE,
                                     ## do you want dispersion stats calculated?
                                     output.phydisp.stats = FALSE)

Note the additional flag options. If dispersion metrics should be calculated for each simulated community, that flag can be turned on. In addition to that, the summary statistics flag can be turned off.

The output is a list of two data frames. The first is of all parameters and the second contains all summary statistics. In both cases, each row in the matrix is a unique simulation and the rows between matrices correspond. The $ operator can be used to look at each matrix within the list.

%%R
Assembly.Data
## $params
##   sim  comsim traitsim   N local lambda     mu   sig2 alpha    tau
## 1   1 neutral       BM 100    50 0.8339 0.2859 3.8814 0.183 2.3082
## 
## $summary.stats
##       Mean.BL   Var.BL   Mean.Tr   Var.Tr    Moran.I      Age   Colless
## [1,] 1.197862 1.464759 -2.751333 16.31271 0.03465375 8.388496 0.1471088
##      Sackin       nLTT     Msig      Cvar       Svar       Shgt      Dcdf
## [1,]    381 0.07205209 3.771697 -5.587071 0.05332621 0.04046575 0.1493878
##             Skw      Krts   MnRegTr  VrRegTr   MnTrDif  VrTrDif   MnRegBl
## [1,] -0.2018474 -0.201342 -2.478311 17.78307 0.2730226 1.470365 0.8811163
##       VarRegBl    MnBlDif   VarBlDif      Amp1       Amp2    BiCoef
## [1,] 0.8899038 -0.3167458 -0.5748553 -3.445615 0.09997727 0.3479025
##      BiRatio Mode1 Mode2
## [1,]       0     0    11

We will look at this output more in depth momentarily. First, let’s simulate much more community data.

Simulate 500 community datasets under all assembly models. We will fix the size of the regional pool and put a prior range on the size of the local species pool. All other parameters are still set with their default prior ranges.

## simulate 500 community datasets under all models of assembly
## leave lambda, eps, sig2, and tau with their default prior ranges
%%R
Assembly.Data <- SimCommunityAssembly(sims = 500, 
                                     N = c(100,200), 
                                     local = c(20, 70),
                                     traitsim = "BM",
                                     comsim = "all",
                                     lambda = c(0.05, 2.0),
                                     eps = c(0.2, 0.8),
                                     sig2 = c(.05, 3.0),
                                     alpha = c(0.01, 0.2),
                                     tau = c(1, 30))

a) parameters

The parameters are important to record for each simulation because, much like in ABC, randomForest relies on using the known paramter values for simulations to make estimate these parameters for empirical data.

%%R
## the first matrix is the parameters for each simulation
head(Assembly.Data$params)
##   sim      comsim traitsim   N local lambda     mu   sig2  alpha     tau
## 1   1   filtering       BM 194    62 0.5569 0.1383 0.1220 0.1858  3.5202
## 2   2 competition       BM 170    51 1.9369 1.1178 1.6025 0.1597 18.0324
## 3   3     neutral       BM 175    61 0.8894 0.2923 2.7841 0.1952 13.1444
## 4   4   filtering       BM 101    70 1.0888 0.4571 0.9149 0.0569 10.2444
## 5   5     neutral       BM 141    23 1.1190 0.4585 0.8373 0.1299 29.4758
## 6   6     neutral       BM 130    23 1.4573 0.5589 1.5378 0.0918 14.8794

Note that we record the extinction rate as mu rather than the extinction fraction eps.

As a sanity check, let’s make sure that we are actually simulating under each model and verify that the parameters drawn match the prior distributions that were designated.

%%R
## this line lists the first 50 models simulated under 
Assembly.Data$params$comsim[1:50]

## see how many of each model were simulated; change "neutral" to "filtering" and "competition"
sum(Assembly.Data$params$comsim=="neutral")

The number of simulations per model will be different for everyone, but hopefully each model has been simulated roughly 1/3 of the time. It is important that each model is equally represented in the training dataset in order to avoid a bias training dataset and ultimately, forest.

To verify the prior distributions, we can plot a histogram of all of the values used for the simulations for a given parameter and verify that it matches the prior distribution designated. Let’s look at the sig2 parameter, which we put a uniform prior on between 1.0 and 5.0.

%%R
library(ggplot2)
ggplot(data=Assembly.Data$params, aes(Assembly.Data$params$sig2)) + 
  geom_histogram(binwidth = .1) +
  xlab("sig2")

We should see a uniform distribution between the upper and lower bound set. Have a look at some of the other parameters and verify the values simulated under match what was set.

b) summary statistics

The summary statistics capture information about the phylogenetic and phenotypic variation across the regional and local community, as well as the interactions between the phylogenetic and phenotypic information in the local community. Go here for a short description of each summary statistic.

%%R
head(Assembly.Data$summary.stats)
##        Mean.BL    Var.BL    Mean.Tr     Var.Tr      Moran.I       Age
## [1,] 1.7331353 2.0745147  0.2834313  0.6583034 -0.005739081 10.257033
## [2,] 0.8455456 0.6341746 -3.0352360 16.8924959 -0.044112280  6.389019
## [3,] 1.0298415 0.8618600  1.7225224 16.5199804 -0.008571531  7.135549
## [4,] 0.6095664 0.3068363  0.3591960  2.0204512  0.051407299  3.447975
## [5,] 1.3714519 1.9069726 -0.1636043  2.7190371 -0.011219058 10.093176
## [6,] 1.0371354 0.6052696 -2.1180320  3.2361401 -0.044787980  4.065084
##         Colless Sackin       nLTT       Msig       Cvar          Svar
## [1,] 0.14426230    512 0.13863865 0.09121485  -2.993423 -0.0059372453
## [2,] 0.09387755    341 0.13431388 1.83391320   2.167033  0.0402161184
## [3,] 0.13389831    481 0.11551049 2.99609208 -12.593101 -0.0355546907
## [4,] 0.08439898    514 0.06431839 0.91496507  -5.842454  0.0518765725
## [5,] 0.27705628    134 0.15123036 0.52339499 103.411366 -0.0052479954
## [6,] 0.15584416    118 0.27363532 1.20099738  -3.987768 -0.0004501872
##             Shgt      Dcdf         Skw       Krts    MnRegTr   VrRegTr
## [1,]  0.04031444 0.2239556 -0.56985264 -0.4702116  0.4152372  1.181878
## [2,] -0.13073793 0.1552941  0.02349498 -1.4039576 -2.9648682  8.454210
## [3,]  0.03065279 0.1071038 -0.20652959 -0.8059628  1.6814880 15.395220
## [4,]  0.15454010 0.1660455  0.28801398 -0.4908889  0.3216289  2.136825
## [5,]  0.01390623 0.2470356 -1.37503920  1.5481585 -0.8928234  5.331753
## [6,]  0.10475357 0.2075099  0.42908383 -0.6048740 -1.1930720  3.649839
##          MnTrDif    VrTrDif   MnRegBl  VarRegBl    MnBlDif    VarBlDif
## [1,]  0.13180583  0.5235742 0.9993313 1.0510301 -0.7338041 -1.02348464
## [2,]  0.07036777 -8.4382858 0.4023239 0.3034804 -0.4432217 -0.33069420
## [3,] -0.04103435 -1.1247601 0.6042966 0.3967216 -0.4255449 -0.46513834
## [4,] -0.03756702  0.1163738 0.4948740 0.2396588 -0.1146924 -0.06717754
## [5,] -0.72921908  2.6127161 0.5525451 0.5222590 -0.8189069 -1.38471365
## [6,]  0.92495992  0.4136993 0.4392461 0.1753194 -0.5978893 -0.42995018
##            Amp1        Amp2    BiCoef   BiRatio Mode1 Mode2
## [1,] -0.5009519  0.61109816 0.4937175 1.9620974     0    46
## [2,] -5.2640524  0.08181265 0.5605905 0.0000000    -5     7
## [3,]  1.0182652  0.08696686 0.4436665 0.0000000     0    13
## [4,] -0.1646709  0.39856483 0.4095733 0.9973487     0    38
## [5,] -4.4015307 -0.22690035 0.5775342 5.1807061     0    13
## [6,] -2.0593978  1.53009020 0.4151477 0.2081092    -2     6

We can look at the distribution of values for each summary statistic and partition them by model.

%%R
## have a look at all of the summary statistics
colnames(Assembly.Data$summary.stats) 

## replace HERE with a name of one of the summary statistics
stats.by.model <- data.frame(model=Assembly.Data$params$comsim, 
                  stat = Assembly.Data$summary.stats[,"Shgt"])

ggplot(stats.by.model, aes(x=stat, fill=model)) +
  geom_density(alpha=0.8) +
  scale_fill_manual(values=c("#005679", "#00B7C4", "#FF8800")) 
##  [1] "Mean.BL"  "Var.BL"   "Mean.Tr"  "Var.Tr"   "Moran.I"  "Age"     
##  [7] "Colless"  "Sackin"   "nLTT"     "Msig"     "Cvar"     "Svar"    
## [13] "Shgt"     "Dcdf"     "Skw"      "Krts"     "MnRegTr"  "VrRegTr" 
## [19] "MnTrDif"  "VrTrDif"  "MnRegBl"  "VarRegBl" "MnBlDif"  "VarBlDif"
## [25] "Amp1"     "Amp2"     "BiCoef"   "BiRatio"  "Mode1"    "Mode2"

3. Random Forest

The randomForest package implements Breiman’s ensemble random forest algorithm for classification and regression (Breiman 2001). Ensemble-ing is a “type of supervised learning technique where multiple models are trained on a training dataset and their individual outputs are combined by some rule to derive the final output.” Essentially, training data are used to build many classification trees and the predictions from each tree are combined for a final inference. The “rule” that governs how outputs are combined is most commonly averaging (for numerical data) or votes (for categorical). In our case, the predictions are model identities.

For more information on the randomForest R package and uses go here.

For an i

%%R
library(randomForest)

To construct a classifier, the summary statistics are the predictor variables of the model, and what are used to construct the decision trees. The response variables are the model identities the summary statistics correspond to, and all decision trees end

Let’s now build a random forest classifier with the community assembly data we have simulated, and then dive into more properties of the classifier.

%%R
## The response variables
modelIndex <- Assembly.Data$params[,"comsim"]

## Organize the predictor variables with the response
ReferenceTable <- data.frame(Assembly.Data$summary.stats, modelIndex)

## build the classifier using all predictors, 
RF.model <- randomForest(modelIndex ~., data=ReferenceTable, ntree=1000, importance=T)

a) error rates

As mentioned before, RF is a supervised machine learning algorithm meaning that we are relying on training data to define our model in order to make accurate prediction. A major concern when making these classifiers is that the data are not sufficient to distinguish between the models. RF internally validates that accuracy of any given classifier by withholding some of the provided data in order to cross-validate the decision trees while they’re being built. Essentially, we can see how often each model is being incorrectly classified, known as the Out-of-Bag (OOB) error rate.

%%R
RF.model
## 
## Call:
##  randomForest(formula = modelIndex ~ ., data = ReferenceTable,      ntree = 1000, importance = T) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 5
## 
##         OOB estimate of  error rate: 23%
## Confusion matrix:
##             competition filtering neutral class.error
## competition          27         0       4   0.1290323
## filtering             0        38       4   0.0952381
## neutral               6         9      12   0.5555556

The error rates for a classifier are not always final. Classifiers can be tuned to improve accuracy in classification; we will learn more about this tomorrow. For now, though, we can plot the forest object and see how the error rates change as trees are added to the forest. One of the biggest advantages of the enemble approach is that it adds predictive power to the forest, because alone, decision trees are weak predictors.

%%R
plot(RF.model, col=c( "gray50", "#005679","#00B7C4", "#FF8800"), lwd=2)
legend("topright", legend=c("filtering", "competition", "neutral", "average"), fill=c("#00B7C4", "#005679", "#FF8800", "gray50"), bty="n", cex=1.2)

4. References

Breimen L. Random Forests. 2001 Machine Learning, 45, 5-32. link

Butler, M.A. & King, A.A. 2004. Phylogenetic Comparative Analysis: A Modeling Approach for Adaptive Evolution. Am. Nat.,164, 683-695. link

Felsenstein, J. 1985. Phylogenies and the Comparative Method. Am. Nat. 125, 1-15. link

Hansen, T.F. 1997. Stabilizing Selection and the Comparative Analysis of Adaptation. Evolution. 51, 1341-1351. link

Stadler, T. 2011. Simulating trees with a fixed number of extant species. Syst. Biol., 60, 676–684. link