Bayesian Estimation of Species Trees


DEB 743330 & 743616

Here is a brief manual for BEST. Users should also be familiar with the manual for MrBayes.

Frequently Asked Questions:

1. How do I specify the alleles (sequences) for each species?

Answer: use the command taxset to specify the alleles for each species.

For example -
taxset s1=1 2; [the first two sequences belong to species s1]
taxset s2=6-9; [the sequences 6,7,8,and 9 belong to species s2]
taxset s3=3 5; [the sequences 3 and 5 belong to the species s3]
taxset s4=4; [the sequece 4 belongs to the species s4]

Note that the taxset command must be used to define which species go with which sequences even when there is only one sequence per species.

2. How do I define a haploid sequence?

Answer: lset applyto=(2,3) ploidy=haploid; [the gene (or partition) 2 and 3 are haploid]

Note the importance of specifying which sequences are haploid and which are diploid (for example, when you have a mixture of mitochondrial and nuclear genes in your data set) to allow the program to adjust for the four-fold difference in population sizes.

3. How do I run BEST from within the modified MrBayes?

Answer: use the prset command in MrBayes/BEST to define the parameters of BEST.

For example -
prset thetapr=invgamma(3,0.003) genemupr=uniform(0.5,1.5) BEST=1;
unlink topology=(all) brlens=(all) genemu=(all);

Note that the parameter "thetapr" specifies the prior distribution of the population size. We recommend using the inverse gamma distribution because it is a conjugate prior and can thus reduce the computational time.
Also, the BEST 2.0 beta uses a uniform prior for the species tree. By contrast, previous versions of BEST used a birth-and-death prior for the species tree. Topology, brach lengths, and gene muation rates should all be unlinked. If necessary, other parameters such as tratio, or shape in the substitution model should also be unlinked.

4. Where do I find the information to summarize the posterior distributions of species trees, divergence times and population sizes?

Answer: BEST can estimate the following parameters:

The gene trees: saved in the .t files
The species tree with divergence times and population sizes: saved in the sptree.t file where divergence times are the numbers after ":", while population sizes are the numbers after "#".
The parameters in the substitution model: saved in the .p file.

5. How do I summarize the posterior distribution when I have one allele per species?

Answer: Note that the file .sptree.t created by BEST is in the format ready to summarize the posterior distribution of species trees using the command "sumt" in BEST with the input file as .sptree and setting best=1 using the "prset" command.

For example -
prset best=1;
sumt nrun=2 filename=test.nex.sptree burnin=5;
BEST will produce a .con file, a .trprobs, and a .parts file.
The file .con contains the consensus tree constructed from the estimated posterior distribution of the species tree, divergence times (after ":"), and population sizes (after "#"). The divergence times and population sizes are estimated by posterior means.
The file .trprobs contains the estimated posterior probabilities for species trees.
The posterior means and standard deviations for divergence times and population sizes are in the file .parts. These can then be used to construct approximate 95% credible regions for divergence times and population sizes.

6. How do I summarize the posterior distribution when I have multiple alleles for some species? (version 2.3)

Answer: In version 2.3 the dummy file described in answer 6a above is created automatically and is named yourfilename.sumt. You simply need to execute this file to produce the desired consensus tree summary.

Important notes:

  • Do not include a "sumt" command in the original data file.
  • In this dummy file the default is set to discard the first 50% of species trees as burn-in. This can be easily changed by editing the file.
  • As usual, you may execute the file either from within BEST using "execute yourfilename.sumt" or directly using "mbbest -i yourfilename.sumt."
  • 7. How does the species tree notation work?

    Answer: It is illustrated below -

    8. How should I set the prior for theta?

    Answer: The inverse gamma prior distribution has two parameters, α and β. When α > 1, the mean of this distribution is β/(α-1) and, for α > 2, the variance = mean/(α-1)(α-2). For smaller α the mean and variance would be infinite. The figure below, from Wikipedia, ilustrates the density function for various α and β.

    One way to determine an appropriate value for theta is then to find out from the literature or from your own data what an approximate average value for theta is for extant species in the clade you are interested in, or in closely related species. Theta is approximately the number of variable sites per base pair (expressed on a per site basis); it is specific to each locus, so an average across loci might be appropriate. In calculating this average be sure to account for the two-fold differences between mitochondrial and nuclear DNA if you have both in your data. For further advice on this issue see Edwards and Beerli (Evolution, 2000).

    9. How do I specify the size of the neighborhood around the MT used in specifying proposal trees?

    Answer: BEST takes a set of proposed gene trees from MrBayes and matches it with a tree in the neighborhood of the Maximum Tree that satifies the constraint that all divergences of species pairs must occur after the respective gene divergences occur. In particular, the Maximum Tree is modified at a random (Poisson) number of nodes, where the mean of this Poisson distribution is specified using the prset command in MrBayes for the variable poissonmean (default value = 5).

    10. How do I specify the cooling rate in the annealing steps?

    Answer: The annealing steps downweight the influence of the prior at the beginning of the burn-in period in order to move more rapidly into areas of high likelihood. The temperature in the annealing step is cooled at a rate controlled by the parameter propTemp which is set using the prset command in MrBayes. For example propTemp=0.05 means that the algorithm will use annealing for the first 5% of generations (default value = 0.1).

    11. Is it possible to run BEST without having it on my own computer?

    Answer: Yes. BEST is available for use on the BioHPC compute cluster at the Computational Biology Service Unit at Cornell University. Go to to run BEST using this resource.

    12. How do I change the maximum number of species currently allowed in BEST?

    Answer: For very large problems, you can change the current maxima in the mb.h file

    #define MAX_NUM_DIVS 1010
    #define MAX_NUM_CHARSETS 1010
    #define MAX_NUM_PARTITIONS 1010
    #define MAX_NUM_TREES 1010
    #define MAX_NUM_TAXASETS 50

    and also in the best.h file

    #define NSPECIES 200
    #define NGENE 1010

    and then recompile.