Spurlingia forsteriana
publication ID |
https://doi.org/ 10.1111/j.1096-3642.2010.00644.x |
persistent identifier |
https://treatment.plazi.org/id/038987F7-FF99-0257-E5E6-FF358685FD77 |
treatment provided by |
Valdenar |
scientific name |
Spurlingia forsteriana |
status |
|
Austrochloritis porteri Eastern Australia
Amplirhagada & Western Australia
Acusta despecta Asia
Nipponchloritis
Theba pisana Europe
Monacha cantiana
Central America
%branches: percentage of the number of branches. %divergence: percentage of sum total branch length.
distinguish these from the east Australian groups (see Discussion). The characteristics of the nested matrix are outlined in Table 3 giving a breakdown of the number of branches (and proportion of overall tree branch length) based upon one, two, and three genes. As all data are from one locus (mtDNA) the gene tree/species tree issue ( Maddison, 1997) is not explored here, and we present the estimated mtDNA phylogeny as a good proxy of the organismal tree.
MtDNA alignment and sequence characteristics
The aligned mtDNA sequence data matrix for this analysis comprises 327 tips containing: 176 COII sequences of 480 sites, 274 16S rDNA of 450 sites, 86 12S rDNA of 522 sites, for a total of 1452 sites. Sixty-two taxa have all three genes, 85 taxa have two genes, 180 taxa have only one gene: a quarter of a million defined characters in a matrix of a half a million, the remaining being missing data. Total matrix of 474 804: 235 208 ACGT states, 218 IUPAC ambiguity states, 222 132 missing states, and 17 246 gap states. Sequences were aligned with ClustalX ( Thompson et al., 1997) using standard parameters, visually inspected for major anomalies, and where available, compared to 16S and 12S secondary structure models .
All sites were used, including gap regions because: (1) amongst closely related sequences there is much less ambiguity; therefore these regions contain valuable information; (2) including a site rate heterogeneity parameter can effectively weigh the degree of uncertainty in homology by the level of overall divergence. Altogether, this gives better resolution near the tips and branch lengths across the depth of the tree. Nevertheless, we tested the effect of removing complex alignment sections in the 62 taxa all-gene data matrix (a total of 7.5% of sites excluded).
Three levels of taxon sampling
The full 327-tip supermatrix contains three levels of data completeness corresponding to datasets where (1) all taxa had all genes, (2) at least two genes, and (3) any gene. Analyses were performed on all three: the 62t analysis had 62 taxa with all three genes and no missing data; the 147t analysis had 147 taxa with at least two genes (62 all three genes and 85 with two of the three genes); and the 327t analysis had 327 taxa (62 with all three genes, 85 with two of the three genes, and 180 with only one gene). Table 2 lists the genes available for each taxon.
Model based estimation of tree support space Phylogenetic inference using stochastic models evaluated by likelihood where missing states are handled by integrating over all possible states is ideally suited to this nested design of molecular sequence data, for both topology and for branch length estimation ( Swofford, 2000; Ronquist & Huelsenbeck, 2003; Driskell et al., 2004; Felsenstein, 2004; de Queiroz & Gatesy, 2007; Wiens & Moen, 2008). Dense taxon sampling including many sequences of a wide range of differences can improve within group alignment, characterization of site by site sequence evolution, and phylogenetic accuracy ( Sullivan, Swofford & Naylor, 1999; Zwickl & Hillis, 2002; Blouin, Butt & Roger, 2005; Hedke et al., 2006). Large numbers of taxa bring an exponential increase in computational burden making the analysis of such a data matrix difficult until development of Bayesian Markov Chain Monte Carlo (MCMC) and fast ML methods, whereas missing data make measures of support such as bootstrapping and Bremer support problematical ( Lee & Hugall, 2003; Wiens, 2003). Bayesian estimation of tree space support may be the best practical method for such a large nested data supermatrix, providing a way of rationalizing the amount of sequence needed to be gathered for tree shape (both topology and branch length) and measures of support. Therefore our analysis was based on MCMC using MrBayes v. 3.1.2 ( Ronquist & Huelsenbeck, 2003), backed up with ML using RAxML 7.0.4 ( Stamatakis, 2006) with fast bootstrapping.
Preliminary MCMC runs were undertaken to determine optimal MCMC run conditions (chain mixing, burnin, parameter effective sample size; posterior probability stability, e.g. Wilgenbusch, Warren & Swofford, 2004; Beiko et al., 2006; Drummond & Rambaut, 2006). Standard (default) MrBayes settings were adequate for the 62t analyses. Additional chains (above the standard four) made little difference to the 147t analysis but lifted mixing in the 327t runs to levels considered suitable. For the final 327t analyses, an approximate starting tree (derived by parsimony) was used to speed up convergence and therefore more efficiently use runtime.
Final run conditions used were:
62 taxa dataset: 5 million steps sampling every 50 steps with a burnin of 20%, four chains with heating temperature = 0.2 (i.e. standard), random starting tree.
147 taxa dataset: 10 million steps sampling every 100 steps, 20% burnin, six chains with heating temperature = 0.15, random starting tree.
327 taxa dataset: 20 million steps sampling every 100 steps, 20% burnin, eight chains with heating temperature = 0.075, approximate MP starting tree.
Analyses were conducted on an e-linux parallel cluster running one chain per processor: the 327t 20 million ¥ 8 chain runs took 250–300 h each. Each analysis was carried out twice and then combined for the final Bayesian estimates of posterior probability (PP).
Sequence evolution model design
Model choice and partition strategies were evaluated with the second-order Akaike information criteria (AICc: Burnham & Anderson, 2003; Lee & Hugall, 2006). For assessing partition strategies AICc used the MCMC equilibrium average lnL ( Hugall et al., 2008). To get the best estimate of phylogeny with branch lengths, the combined total data were used with branch length estimates linked. However, each gene may have a different overall rate. Despite missing data, for each branch the union of genes can contribute to branch length estimation, and by linking branch lengths all sites can contribute information, with a relative rate parameter (m) for each gene. Given the hierarchical dataset and hence blocks of missing data, partition by gene was considered desirable as the rate parameter may account for different rates amongst genes and hence some correction amongst branches estimated from different genes ( Ren, Tanaka & Yang, 2008; Lemmon et al., 2009). A further two strategies were compared. As a result of limited secondary structure information, rDNA was split into length variable (‘loops’) vs. length-stable regions (‘stems’). Therefore the three models investigated were: (1) single partition model; (2) three partition model (3p: COII, 12S, 16S); (3) four partition model (4p: COII 1 st + 2nd, 3rd, rDNA ‘loops’, rDNA ‘stems’). AICc indicated that the most general model (general time-reversible with gamma rate heterogeneity and proportion of invariant sites parameters: GTR-G-inv) was appropriate for each partition type, and that the three and four partitions, but not more complex arrangements, were considered improvements over the single partition strategy.
No known copyright restrictions apply. See Agosti, D., Egloff, W., 2009. Taxonomic information exchange and copyright: the Plazi approach. BMC Research Notes 2009, 2:53 for further explanation.