We will begin by installing devtools. Devtools is an R package that enables the development of other R packages. You can use devtools to install a package that is managed in a github repository. We will install CAMI this way.
## install devtools with the line below if you do not already have it
install.packages("devtools")
library(devtools)
## install CAMI from github using devtools with the line below
devtools::install_github("ruffleymr/CAMI")
library("CAMI")
To see the functions available in CAMI, or and R package for that matter, you can use name of the funcion followed by ::
and tab complete. If we do this for CAMI, CAMI::
, we can see that there are three functions. Use ?
before the name of each function to see the help documentation.
For full tutorial using CAMI on an empircal dataset, see CAMI tutorial.
To simulate community assembly data we will be using the SimCommunityAssembly()
function. This function simulates phylogenetic and phenotypic community assembly data in a species-based model. The function takes several parameters, which can be seen in the help documentation using ?SimCommunityAssembly
. This function can be used to simulate 1 to many community assembly datasets, set using the sims
argument.
For a single simulation, first, a regional community phylogeny is simulated under a constant birth-death process. CAMI
uses the function sim.bd.taxa
in the R package TreeSim
(Stadler 2011).
N
determines the size of the regional species poolslambda
is the speciation rate in species/million yearseps
is the extinction fraction, which is used to calculate the extinction rate from the speciation rateTrait evolution is then modeled along this phylogeny under one of two models of trait evolution, Brownian Motion (BM; Felsenstein 1985) and Ornstein-Uhlenbeck (OU; Hansen 1997, Butler & King 2004), determined by the traitsim
argument.
traitsim
either “BM” or “OU”sig2
is the instantaneous rate of character change for either modelalpha
controls the strength of pull to the trait optimum and is for OU models onlyOnce the regional community pool exists, the species are assembled into local communities under one of three assembly models, neutral, habitat filtering, and competitive exclusion, set using the comsim
argument. The non-neutral assembly models use the phenotypic variation in the regional community and phenotypic matching and repulsion equations, for filtering and competition models respectively, to determine which species will survive in the local community.
comsim
either “neutral”, “filtering”, “competition”, or “all”tau
controls the strength of the non-neutral assembly processesFor filtering models, the lower tau
is, the stronger the environmental filter. For competition models, the higher tau
, the stronger the effects of competition.
For all of these parameters, a uniform prior distribution of parameter values or a fixed value can be designated. Many of the parameters have default uniform prior distributions, see the help page for details on each parameter.
We will start with one simulation under a neutral model of assembly and BM trait evolution. The regional community pool will consist of 100 species and the local community pool will only have 50. All other parameters will be kept under their default uniform prior ranges.
## simulate 500 community datasets under all models of assembly
## leave lambda, eps, sig2, and tau with their default prior ranges
%%R
Assembly.Data <- SimCommunityAssembly(sims = 1,
N = 100,
local = 50,
traitsim = "BM",
comsim = "neutral",
## do you want summmary stats calculated?
output.sum.stats = TRUE,
## do you want dispersion stats calculated?
output.phydisp.stats = FALSE)
Note the additional flag options. If dispersion metrics should be calculated for each simulated community, that flag can be turned on. In addition to that, the summary statistics flag can be turned off.
The output is a list of two data frames. The first is of all parameters and the second contains all summary statistics. In both cases, each row in the matrix is a unique simulation and the rows between matrices correspond. The $
operator can be used to look at each matrix within the list.
%%R
Assembly.Data
## $params
## sim comsim traitsim N local lambda mu sig2 alpha tau
## 1 1 neutral BM 100 50 0.8339 0.2859 3.8814 0.183 2.3082
##
## $summary.stats
## Mean.BL Var.BL Mean.Tr Var.Tr Moran.I Age Colless
## [1,] 1.197862 1.464759 -2.751333 16.31271 0.03465375 8.388496 0.1471088
## Sackin nLTT Msig Cvar Svar Shgt Dcdf
## [1,] 381 0.07205209 3.771697 -5.587071 0.05332621 0.04046575 0.1493878
## Skw Krts MnRegTr VrRegTr MnTrDif VrTrDif MnRegBl
## [1,] -0.2018474 -0.201342 -2.478311 17.78307 0.2730226 1.470365 0.8811163
## VarRegBl MnBlDif VarBlDif Amp1 Amp2 BiCoef
## [1,] 0.8899038 -0.3167458 -0.5748553 -3.445615 0.09997727 0.3479025
## BiRatio Mode1 Mode2
## [1,] 0 0 11
We will look at this output more in depth momentarily. First, let’s simulate much more community data.
Simulate 500 community datasets under all assembly models. We will fix the size of the regional pool and put a prior range on the size of the local species pool. All other parameters are still set with their default prior ranges.
## simulate 500 community datasets under all models of assembly
## leave lambda, eps, sig2, and tau with their default prior ranges
%%R
Assembly.Data <- SimCommunityAssembly(sims = 500,
N = c(100,200),
local = c(20, 70),
traitsim = "BM",
comsim = "all",
lambda = c(0.05, 2.0),
eps = c(0.2, 0.8),
sig2 = c(.05, 3.0),
alpha = c(0.01, 0.2),
tau = c(1, 30))
The parameters are important to record for each simulation because, much like in ABC, randomForest
relies on using the known paramter values for simulations to make estimate these parameters for empirical data.
%%R
## the first matrix is the parameters for each simulation
head(Assembly.Data$params)
## sim comsim traitsim N local lambda mu sig2 alpha tau
## 1 1 filtering BM 194 62 0.5569 0.1383 0.1220 0.1858 3.5202
## 2 2 competition BM 170 51 1.9369 1.1178 1.6025 0.1597 18.0324
## 3 3 neutral BM 175 61 0.8894 0.2923 2.7841 0.1952 13.1444
## 4 4 filtering BM 101 70 1.0888 0.4571 0.9149 0.0569 10.2444
## 5 5 neutral BM 141 23 1.1190 0.4585 0.8373 0.1299 29.4758
## 6 6 neutral BM 130 23 1.4573 0.5589 1.5378 0.0918 14.8794
Note that we record the extinction rate as
mu
rather than the extinction fractioneps
.
As a sanity check, let’s make sure that we are actually simulating under each model and verify that the parameters drawn match the prior distributions that were designated.
%%R
## this line lists the first 50 models simulated under
Assembly.Data$params$comsim[1:50]
## see how many of each model were simulated; change "neutral" to "filtering" and "competition"
sum(Assembly.Data$params$comsim=="neutral")
The number of simulations per model will be different for everyone, but hopefully each model has been simulated roughly 1/3 of the time. It is important that each model is equally represented in the training dataset in order to avoid a bias training dataset and ultimately, forest.
To verify the prior distributions, we can plot a histogram of all of the values used for the simulations for a given parameter and verify that it matches the prior distribution designated. Let’s look at the sig2
parameter, which we put a uniform prior on between 1.0 and 5.0.
%%R
library(ggplot2)
ggplot(data=Assembly.Data$params, aes(Assembly.Data$params$sig2)) +
geom_histogram(binwidth = .1) +
xlab("sig2")
We should see a uniform distribution between the upper and lower bound set. Have a look at some of the other parameters and verify the values simulated under match what was set.
The summary statistics capture information about the phylogenetic and phenotypic variation across the regional and local community, as well as the interactions between the phylogenetic and phenotypic information in the local community. Go here for a short description of each summary statistic.
%%R
head(Assembly.Data$summary.stats)
## Mean.BL Var.BL Mean.Tr Var.Tr Moran.I Age
## [1,] 1.7331353 2.0745147 0.2834313 0.6583034 -0.005739081 10.257033
## [2,] 0.8455456 0.6341746 -3.0352360 16.8924959 -0.044112280 6.389019
## [3,] 1.0298415 0.8618600 1.7225224 16.5199804 -0.008571531 7.135549
## [4,] 0.6095664 0.3068363 0.3591960 2.0204512 0.051407299 3.447975
## [5,] 1.3714519 1.9069726 -0.1636043 2.7190371 -0.011219058 10.093176
## [6,] 1.0371354 0.6052696 -2.1180320 3.2361401 -0.044787980 4.065084
## Colless Sackin nLTT Msig Cvar Svar
## [1,] 0.14426230 512 0.13863865 0.09121485 -2.993423 -0.0059372453
## [2,] 0.09387755 341 0.13431388 1.83391320 2.167033 0.0402161184
## [3,] 0.13389831 481 0.11551049 2.99609208 -12.593101 -0.0355546907
## [4,] 0.08439898 514 0.06431839 0.91496507 -5.842454 0.0518765725
## [5,] 0.27705628 134 0.15123036 0.52339499 103.411366 -0.0052479954
## [6,] 0.15584416 118 0.27363532 1.20099738 -3.987768 -0.0004501872
## Shgt Dcdf Skw Krts MnRegTr VrRegTr
## [1,] 0.04031444 0.2239556 -0.56985264 -0.4702116 0.4152372 1.181878
## [2,] -0.13073793 0.1552941 0.02349498 -1.4039576 -2.9648682 8.454210
## [3,] 0.03065279 0.1071038 -0.20652959 -0.8059628 1.6814880 15.395220
## [4,] 0.15454010 0.1660455 0.28801398 -0.4908889 0.3216289 2.136825
## [5,] 0.01390623 0.2470356 -1.37503920 1.5481585 -0.8928234 5.331753
## [6,] 0.10475357 0.2075099 0.42908383 -0.6048740 -1.1930720 3.649839
## MnTrDif VrTrDif MnRegBl VarRegBl MnBlDif VarBlDif
## [1,] 0.13180583 0.5235742 0.9993313 1.0510301 -0.7338041 -1.02348464
## [2,] 0.07036777 -8.4382858 0.4023239 0.3034804 -0.4432217 -0.33069420
## [3,] -0.04103435 -1.1247601 0.6042966 0.3967216 -0.4255449 -0.46513834
## [4,] -0.03756702 0.1163738 0.4948740 0.2396588 -0.1146924 -0.06717754
## [5,] -0.72921908 2.6127161 0.5525451 0.5222590 -0.8189069 -1.38471365
## [6,] 0.92495992 0.4136993 0.4392461 0.1753194 -0.5978893 -0.42995018
## Amp1 Amp2 BiCoef BiRatio Mode1 Mode2
## [1,] -0.5009519 0.61109816 0.4937175 1.9620974 0 46
## [2,] -5.2640524 0.08181265 0.5605905 0.0000000 -5 7
## [3,] 1.0182652 0.08696686 0.4436665 0.0000000 0 13
## [4,] -0.1646709 0.39856483 0.4095733 0.9973487 0 38
## [5,] -4.4015307 -0.22690035 0.5775342 5.1807061 0 13
## [6,] -2.0593978 1.53009020 0.4151477 0.2081092 -2 6
We can look at the distribution of values for each summary statistic and partition them by model.
%%R
## have a look at all of the summary statistics
colnames(Assembly.Data$summary.stats)
## replace HERE with a name of one of the summary statistics
stats.by.model <- data.frame(model=Assembly.Data$params$comsim,
stat = Assembly.Data$summary.stats[,"Shgt"])
ggplot(stats.by.model, aes(x=stat, fill=model)) +
geom_density(alpha=0.8) +
scale_fill_manual(values=c("#005679", "#00B7C4", "#FF8800"))
## [1] "Mean.BL" "Var.BL" "Mean.Tr" "Var.Tr" "Moran.I" "Age"
## [7] "Colless" "Sackin" "nLTT" "Msig" "Cvar" "Svar"
## [13] "Shgt" "Dcdf" "Skw" "Krts" "MnRegTr" "VrRegTr"
## [19] "MnTrDif" "VrTrDif" "MnRegBl" "VarRegBl" "MnBlDif" "VarBlDif"
## [25] "Amp1" "Amp2" "BiCoef" "BiRatio" "Mode1" "Mode2"
The randomForest
package implements Breiman’s ensemble random forest algorithm for classification and regression (Breiman 2001). Ensemble-ing is a “type of supervised learning technique where multiple models are trained on a training dataset and their individual outputs are combined by some rule to derive the final output.” Essentially, training data are used to build many classification trees and the predictions from each tree are combined for a final inference. The “rule” that governs how outputs are combined is most commonly averaging (for numerical data) or votes (for categorical). In our case, the predictions are model identities.
For more information on the randomForest
R package and uses go here.
For an i
%%R
library(randomForest)
To construct a classifier, the summary statistics are the predictor variables of the model, and what are used to construct the decision trees. The response variables are the model identities the summary statistics correspond to, and all decision trees end
Let’s now build a random forest classifier with the community assembly data we have simulated, and then dive into more properties of the classifier.
%%R
## The response variables
modelIndex <- Assembly.Data$params[,"comsim"]
## Organize the predictor variables with the response
ReferenceTable <- data.frame(Assembly.Data$summary.stats, modelIndex)
## build the classifier using all predictors,
RF.model <- randomForest(modelIndex ~., data=ReferenceTable, ntree=1000, importance=T)
As mentioned before, RF is a supervised machine learning algorithm meaning that we are relying on training data to define our model in order to make accurate prediction. A major concern when making these classifiers is that the data are not sufficient to distinguish between the models. RF internally validates that accuracy of any given classifier by withholding some of the provided data in order to cross-validate the decision trees while they’re being built. Essentially, we can see how often each model is being incorrectly classified, known as the Out-of-Bag (OOB) error rate.
%%R
RF.model
##
## Call:
## randomForest(formula = modelIndex ~ ., data = ReferenceTable, ntree = 1000, importance = T)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 5
##
## OOB estimate of error rate: 23%
## Confusion matrix:
## competition filtering neutral class.error
## competition 27 0 4 0.1290323
## filtering 0 38 4 0.0952381
## neutral 6 9 12 0.5555556
The error rates for a classifier are not always final. Classifiers can be tuned to improve accuracy in classification; we will learn more about this tomorrow. For now, though, we can plot the forest object and see how the error rates change as trees are added to the forest. One of the biggest advantages of the enemble approach is that it adds predictive power to the forest, because alone, decision trees are weak predictors.
%%R
plot(RF.model, col=c( "gray50", "#005679","#00B7C4", "#FF8800"), lwd=2)
legend("topright", legend=c("filtering", "competition", "neutral", "average"), fill=c("#00B7C4", "#005679", "#FF8800", "gray50"), bty="n", cex=1.2)
Breimen L. Random Forests. 2001 Machine Learning, 45, 5-32. link
Butler, M.A. & King, A.A. 2004. Phylogenetic Comparative Analysis: A Modeling Approach for Adaptive Evolution. Am. Nat.,164, 683-695. link
Felsenstein, J. 1985. Phylogenies and the Comparative Method. Am. Nat. 125, 1-15. link
Hansen, T.F. 1997. Stabilizing Selection and the Comparative Analysis of Adaptation. Evolution. 51, 1341-1351. link
Stadler, T. 2011. Simulating trees with a fixed number of extant species. Syst. Biol., 60, 676–684. link