Sample Complexity of Species Tree Estimation

ASTRAL (Mirarab et al., 2014)1 is a method of reconstructing species trees from a given set of gene trees that have been reconstructed from sequence data. While it was shown to be statistically consistent in (Mirarab et al., 2014) 1 , not much was known about its sample complexity, i.e., the number of gene trees ($m$) required to reconstruct the true species tree with high probability. In (Shekhar et al., 2018)2, we provided a tight characterization of the data requirement (i.e., $m$) of ASTRAL in terms of the number of leaves ($n$) and the length of the shortest branch ($f$).

More formally, under the Multispecies Coalescent (MSC) model, for the class of species trees with $n$ leaves and the shortest branch length ($f$) sufficiently small, we obtained the following results:

  • We first showed that $\mathcal{O} \left( f^{-2} \log n \right)$ gene trees are sufficient for ASTRAL to output the true species tree with high probability. This result follows from the standard argument of applying Hoeffding’s inequality followed by a union bound.

  • Proving that $m= \Omega\left ( f^{-2} \log n \right)$ is also necessary was more involved. We proceeded in the following steps:

    • We began by proving a weaker result. We showed that with $m \leq \mathcal{O} \left (f^{-2}\right)$ gene trees, a quartet species tree is wrongly reconstructed by ASTRAL with probability close to 0.5. For obtaining this result, we reduced the error event to the study the deviation of a binomial random variable and then used the Berry-Esseen theorem to approximate this binomial. Having obtained the result for the quartet ($n=4$) case, we extended this result to the general case by constructing a species tree consisting of a triplet joined to the rest of the tree by a long branch.

    • Using the insights from the previous result, we then strengthened it to match the sufficient conditions on $m$. In particular, by allowing an extra $\log(n)$ factor and using stronger deviation inequalities for binomial random variables, we showed that the error in reconstructing the quartet species tree is at least $1/n^{a}$ for some $a>0$ . Then by considering a tree with $n/3$ triplets joined by long branches, we showed that the reconstruction error can be made arbitrarily close to $1$, for $m \leq (a/5)f^{-2} \log n$.

These results imply that for ASTRAL to guarantee correct reconstruction with high probability uniformly over the space of all species trees, $\Theta\left(f^{-2}\log n\right)$ gene trees are both necessary and sufficient.

References


  1. Mirarab, S., Reaz, R., Bayzid, M. S., Zimmermann, T., Swenson, M. S., & Warnow, T. (2014). ASTRAL: genome-scale coalescent-based species tree estimation. Bioinformatics. ↩︎

  2. Shekhar, S., Roch, S., & Mirarab, S. (2017). Species tree estimation using ASTRAL: how many genes are enough?. IEEE/ACM transactions on computational biology and bioinformatics. ↩︎

Shubhanshu Shekhar
Shubhanshu Shekhar
Ph.D. Candidate

Interested in sequential learning and optimization.