Skip Navigation


MBE Advance Access originally published online on December 5, 2003
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
21/3/468    most recent
msh039v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (83)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Siepel, A.
Right arrow Articles by Haussler, D.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Siepel, A.
Right arrow Articles by Haussler, D.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Mol. Biol. Evol. 21(3):468-488. 2004
DOI: 10.1093/molbev/msh039
© 2004 by the Society for Molecular Biology and Evolution. ISSN: 0737-4038

Phylogenetic Estimation of Context-Dependent Substitution Rates by Maximum Likelihood

Adam Siepel* and David Haussler*,{dagger}

* Center for Biomolecular Science and Engineering, University of California, Santa Cruz
{dagger} Howard Hughes Medical Institute, University of California, Santa Cruz

E-mail: acs{at}soe.ucsc.edu.


    Abstract
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material

 Acknowledgements
 Literature Cited
 
Nucleotide substitution in both coding and noncoding regions is context-dependent, in the sense that substitution rates depend on the identity of neighboring bases. Context-dependent substitution has been modeled in the case of two sequences and an unrooted phylogenetic tree, but it has only been accommodated in limited ways with more general phylogenies. In this article, extensions are presented to standard phylogenetic models that allow for better handling of context-dependent substitution, yet still permit exact inference at reasonable computational cost. The new models improve goodness of fit substantially for both coding and noncoding data. Considering context dependence leads to much larger improvements than does using a richer substitution model or allowing for rate variation across sites, under the assumption of site independence. The observed improvements appear to derive from three separate properties of the models: their explicit characterization of context-dependent substitution within N-tuples of adjacent sites, their ability to accommodate overlapping N-tuples, and their rich parameterization of the substitution process. Parameter estimation is accomplished using an expectation maximization algorithm, with a quasi-Newton algorithm for the maximization step; this approach is shown to be preferable to ordinary Newton methods for parameter-rich models. Overlapping tuples are efficiently handled by assuming Markov dependence of the observed bases at each site on those at the N - 1 preceding sites, and the required conditional probabilities are computed with an extension of Felsenstein's algorithm. Estimated substitution rates based on a data set of about 160,000 noncoding sites in mammalian genomes indicate a pronounced CpG effect, but they also suggest a complex overall pattern of context-dependent substitution, comprising a variety of subtle effects. Estimates based on about 3 million sites in coding regions demonstrate that amino acid substitution rates can be learned at the nucleotide level, and suggest that context effects across codon boundaries are significant.

Key Words: neighbor-dependent substitution • CpG effect • codon model • expectation maximization • substitution rate matrix


    Introduction
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material

 Acknowledgements
 Literature Cited
 
Many of the simplifying assumptions originally introduced for phylogenetic inference have been relaxed, resulting in models of improved realism and power. For example, the assumptions that each nucleotide or amino acid substitutes for the other at the same average rate, and that this average rate is constant across sites (Neyman 1971; Felsenstein 1981), have both given way to more realistic alternatives (Whelan, Liò, and Goldman 2001). One assumption that has largely been maintained, however—despite general agreement that it is biologically unrealistic—is that the sites at which evolution occurs can be considered independent.

Various methods have been proposed for relaxing site independence in limited ways, without sacrificing the essential strengths of maximum-likelihood methods for phylogenetic inference. For example, models have been introduced that allow for dependence between sites of the same codon (Goldman and Yang 1994; Muse and Gaut 1994; Pedersen, Wiuf, and Christiansen 1998; Yang et al. 2000; Schadt and Lange 2002), correlation of the overall rates of substitution at adjacent sites (Felsenstein and Churchill 1996; Yang 1995), and dependence between paired bases in ribosomal RNA genes (Schöniger and von Haeseler 1994; Muse 1995; Rzhetsky 1995; Tillier and Collins 1995). In addition, models have been developed for the special case of two sequences and a reversible substitution process that allow for general context-dependent substitution, with substitution rates for each base depending on the identity of flanking bases (Jensen and Pedersen 2000; Pedersen and Jensen 2001). With these models, the likelihood computation can no longer be expressed as a product over the sites of an alignment, and exact parameter estimation becomes intractable (a Markov random field arises from the bidirectional dependencies between adjacent sites); Markov chain Monte Carlo (MCMC) or similar methods must be used instead. (Another, simpler model of context-dependent substitution for pairs of sequences has been developed by Arndt, Burge and Hwa (2002), along with an approximation procedure, adapted from nonlinear dynamics, that allows for efficient parameter estimation.) These models accurately reflect an assumed process of context-dependent substitution, and therefore, will be described here as "process-based," in contrast to more empirical models (see below). To our knowledge, process-based models for context-dependent substitution have not yet been extended for use with a general phylogenetic tree.

It is now widely agreed that site independence is an unacceptable assumption in coding regions, where selection acts primarily at the level of amino acids, which are coded for by triplets of adjacent nucleotides. Indeed, the inadequacy of this assumption is the premise underlying codon models, which do appear to improve significantly on the goodness of fit of independent-site models in coding regions (Goldman and Yang 1994). Substitution rates in noncoding DNA, however, are also highly dependent on neighboring bases, and this phenomenon has mostly been ignored in phylogenetic analysis (except that sites subject to the strongest dependencies are often discarded).

Studies of aligned gene/pseudogene pairs in primates have revealed strong evidence of context-dependent substitution in (presumably) neutrally evolving regions, with the rate at which one nucleotide substitutes for another varying as much as 10-fold for different sets of flanking bases, and context-dependent rates overall spanning a more than 50-fold range (Blake, Hess, and Nicholson-Tuell 1992; Hess, Blake, and Blake 1994). Similar effects have been observed recently in connection with single-nucleotide polymorphisms (SNPs) in the human genome (Zhao and Boerwinkle 2002). The strongest context effects in these studies have been seen with C->T transitions in CpG dinucleotides (or G->A transitions on the reverse strand), and are presumably due to methylation and spontaneous deamination of cytosines (Bulmer 1986; Ehrlich, Zhang, and Inamdar 1990). Other effects have been noted as well, however, such as a tendency for higher rates of substitution where purines and pyrimidines alternate, an increased rate of T->C transitions in TpA dinucleotides, and a higher than expected rate of C->A and C->G (G->T and G->A) transversions in CpG dinucleotides (Blake, Hess, and Nicholson-Tuell 1992; Hess, Blake, and Blake 1994). A somewhat weak dependence of the transition/transversion ratio on the A + T content of flanking bases has also been observed in primates (Zhao and Boerwinkle 2002; Yang, Chen, and Li 2002), echoing the stronger dependence observed in the choloroplast and mitochondrial genomes of plants (Morton and Clegg 1995; Morton, Oberholzer, and Clegg 1997; Yang, Chen, and Li 2002). Context effects are strongest for the two immediate neighbors of a given site, but under certain conditions, they can be detected as well for the second and third pairs of flanking bases (Morton 1997; Morton, Oberholzer, and Clegg 1997). Subtle effects may extend as far as 200 bp (Zhao and Boerwinkle 2002), and longer-range dependencies of substitution rates on base composition and other regional properties have also been observed (Bernardi 2000; Hardison, Roskin, and Yang 2003). Context dependence of substitution rates is thought to be the result of neighbor effects on various phenomena, including polymerase fidelity and proofreading, mismatch repair, and mismatch stability (briefly reviewed by Blake, Hess, and Nicholson-Tuell 1992; Morton, Oberholzer, and Clegg 1997).

Most studies of context-dependent substitution rates in noncoding DNA have used simple counting methods and pairwise alignments of highly similar sequences. These methods have some clear deficiencies: for example, they do not allow for multiple substitutions per site or differences in evolutionary distance between aligned pairs of sequences (e.g., in different gene/pseudogene pairs), and they do not benefit from consideration of multiple homologous sequences and a phylogeny. If context-dependent substitution could be incorporated into probabilistic phylogenetic models, better estimates of context-dependent substitution rates might be obtained, as well as improved estimates of branch lengths and (possibly) tree topologies.

In this article, methods are introduced for incorporating context-dependent substitution into phylogenetic models that are more closely related to methods used in computational gene-finding (Durbin et al. 1998; Siepel and Haussler 2004) than to those used with process-based models. The basic strategy here is to extend and generalize codon models for improved handling of context dependence, without sacrificing too much in the way of computational efficiency. We stop short of a process-based description of context-dependent substitution, and as a result are able, for the most part, to retain the standard framework for likelihood computation and parameter estimation (the need for sampling or approximation algorithms is avoided). We begin with the introduction of arbitrary "N-th order" models, defined in terms of N-tuples of characters (for small N), and then a discussion of how these models can be fitted to large data sets using an expectation maximization (EM) algorithm. Next, a method is introduced, based on an assumption of Markov dependence of each site on its N - 1 predecessors, that allows N-tuples to overlap, so that context effects can be captured between all adjacent pairs of sites. Results are then presented for two large data sets of mammalian sequence—one of about 160,000 sites in noncoding regions and one of about 3 million sites in coding regions—which demonstrate that our context-dependent models produce significant improvements in goodness of fit compared with both codon models and independent-site models. Allowing for overlapping tuples of sites is shown to be important, as is using a sufficiently rich parameterization of the substitution process. Parameter estimates are presented for context-dependent substitution in both non-coding and coding regions, and discussed in some detail.


    Materials and Methods
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material

 Acknowledgements
 Literature Cited
 
Background and Notation
Let a phylogenetic model {psi} = (Q, {pi}, {tau}, ß) be a four-tuple consisting of a substitution rate matrix Q, a vector of equilibrium frequencies {pi}, a (rooted) binary tree {tau} = (, ) (where and represent the nodes [vertices] and branches [edges] of the tree, respectively), and a set of branch lengths ß = {ßu|u - {r}} (where ßu corresponds to the branch between node u and its parent, denoted {sigma}(u), and r is the root of the tree). The model is defined with respect to an alphabet {Sigma} of size d, e.g., {Sigma} = (A, C, G, T). In standard phylogenetic models, Q has dimension d x d, and {pi} has dimension d. The tree has n leaves, or external nodes, corresponding to n present-day taxa; hence, in total there are 2n - 1 nodes and 2n - 2 branches. It is sometimes useful to distinguish between the set of leaves, {subseteq} , and the set of internal nodes, = - . A rooted tree is assumed here for simplicity, but in practice an unrooted tree is often used; only minor changes to our methods are required for unrooted trees.

The parameters of a phylogenetic model are estimated with respect to a multiple alignment of n sequences, whose length (number of columns) we denote L. It is assumed that there has existed a sequence of length L, with characters drawn from {Sigma}, corresponding to every node u ; the alignment consists of the subset of these sequences that correspond to leaves of the tree. Let {Xu,i} be a set of random variables, such that Xu,i represents the ith character (1 <= i <= L) in the sequence corresponding to node u . The random variables corresponding to leaves of the tree are observed, with values given by the alignment, and the random variables corresponding to ancestral nodes are unobserved, or latent. Let X denote the set of observed variables, X = {Xu,i|u }, and let X{circ} denote the set of latent variables, X{circ} = {Xu,i|u }. Let x and x{circ} denote instances of X and X{circ}, respectively, so that x is completely defined by the given alignment, and x{circ} represents a set of possible ancestral sequences. Let X•,i and X{circ},i be the sets of observed and latent variables, respectively, corresponding to column i in the alignment, X•,i = {Xu,i|u } and X{circ},i = {Xu,i|u }, and let x•,i and x{circ},i be sets of instances of these variables.

The likelihood of a phylogenetic model {psi} with respect to an alignment x is obtained by summing over all possible values of the latent variables. Assuming site independence,


The probability of an individual column, P(x•,i|{psi}) = {sum} P(x•,i, x{circ},i|{psi}), can be computed with Felsenstein's "pruning" algorithm (Felsenstein 1981; Durbin et al. 1998) (see Appendix A), assuming that the probability is available of the substitution of any b {Sigma} for any a {Sigma} over a branch of length t (t >= 0). These probabilities, denoted P(b|a, t), are based on a continuous-time Markov model of substitution, defined by the rate matrix Q. (The substitution process is assumed to be stationary and homogeneous.) Let the elements of Q be denoted {qa,b} (for a, b {Sigma}), with qa,b (a != b) representing the instantaneous rate at which a is replaced by b, and elements on the main diagonal defined such that each row sums to zero; i.e., qa,a = -{sum}b{Sigma}-{a} qa,b. The probabilities P(b|a, t) are given by the elements of the matrix P(t) = exp(Qt), where exp(Qt) = (Karlin and Taylor 1975; Liò and Goldman 1998). (We will sometimes denote P(b|a, t) as [exp(Qt)]a,b, to be clear that it is a function of Q.) Q can be parameterized in various ways (Yang 1994a; Whelan, Liò, and Goldman 2001); in this article, we consider three of the standard parameterizations, corresponding to the HKY, REV (general reversible), and UNR (unrestricted) models (see table 1). By convention, Q is scaled such that t can be interpreted as the expected number of substitutions per site (the scaling constraint is {sum}a,b{Sigma}:a!=b {pi}aqa,b = 1).


View this table:
[in this window]
[in a new window]
 
Table 1 Summary of Nucleotide Substitution Models Discussed in This Article.

 
A phylogenetic model can be extended easily to describe N-tuples of sites, for small N: the alphabet {Sigma} is simply replaced with {Sigma}N, and the dimensions of Q and {pi} are adjusted accordingly. As long as the N-tuples can be assumed independent, the procedures for computing the likelihood of a model and estimating its parameters remain essentially unchanged, although they become more computationally intensive. (The matrix P(t) = exp(Qt) now describes the joint probabilities of substitution of N-tuples of bases.) We refer to such a model as having order N, and when N > 1, we call the model context-dependent. In this article, four kinds of Nth order models are considered, corresponding to four parameterizations of the rate matrix Q: an unrestricted model (UN), a reversible model (RN), a strand-symmetric unrestricted model (UNS), and a strand-symmetric reversible model (RNS) (see table 1). In all cases, the number of parameters is kept manageable by assuming an instantaneous rate of zero for substitutions of multiple nucleotides, as in most codon models. A scaling constraint of {pi}aqa,b = N allows for branch-length units of substitutions per site. See Appendix B for additional details.

For context-dependent models, the letter Z will be used in place of X to denote both N-tuples of sites and N-tuples of random variables. N-tuples will be assumed to be composed of adjacent sites, so that Zu, j = (Xu, N( j-1)+1, ... , Xu, Nj), Z•, j = (X•, N( j -1)+1, ... , X•, Nj), and Z{circ}, j = (X{circ}, N( j-1)+1, ... , X{circ}, Nj) (where 1 <= j <= L', L' = L/N). As with xi and x{circ}i, z•, j and z{circ}, j are instances of Z•, j and Z{circ}, j, respectively. To complete the parallel, the variables Z, Z{circ}, z, and z{circ} are defined, but are equivalent to X, X{circ}, x, and x{circ}, respectively. When N-tuples are independent (and nonoverlapping, as above), the likelihood of the model is simply P(z|{psi}) = P(z•, j|{psi}). The probability of each tuple of columns, P(z•, j|{psi}) = {sum} P(z•, j, z{circ}, j|{psi}), can be computed with Felsenstein's algorithm, as before, but with substitution probabilities based on N-tuples of nucleotides. Overlapping N-tuples are discussed below.

Variation in the rate of evolution between different sites can be accommodated, with context-dependent as well as independent-site models, using the discrete gamma method of Yang (1993; 1994b). This method is based on the introduction of a random scaling parameter for the branch lengths ß, which is assumed to have a gamma distribution, whose shape parameter {alpha} is estimated from the data. The distribution is partitioned it into k "rate categories" of equal probability, each with a rate constant rl (1 <= l <= k), and the probability P(z•, j|{psi}) is approximated as


a quantity that can be computed with k invocations of Felsenstein's algorithm. Even quite small values of k appear to provide adequate approximations of the full continuous distribution; k = 4 was recommended by Yang (1994b).

Estimating the Parameters of a Context-Dependent Model
A maximum-likelihood estimate (m.l.e.) of the parameters of a phylogenetic model, = argmax{psi} P(z|{psi}), can be obtained by searching over tree topologies, and numerically optimizing (Q, {pi}, ß) conditional on each possible {tau}. In this article, {tau} is assumed to be known, and the problem is simplified considerably. In addition, {pi} is estimated directly from z, as is the standard practice (each frequency {pi}a is estimated as the observed frequency of tuple a in z), so that the critical problem is to obtain m.l.e.'s of Q and ß conditional on {pi} and {tau}.

Many software packages, including PAML (Yang 1997), MOLPHY (Adachi and Hasegawa 1996), fastDNAml (Olsen et al. 1994), PHYLIP (Felsenstein 1993), and PAUP (Swofford 2002), use Newton-Raphson or quasi-Newton algorithms for parameter estimation (Press et al. 1992), computing partial derivatives of the likelihood function numerically or with a combination of numerical and analytical methods. Newton-based optimization algorithms are not well suited for models with richly parameterized rate matrices, however, because of the computational cost of computing partial derivatives of the likelihood function with respect to rate-matrix parameters (the same is true of any method that makes use of function derivatives, such as conjugate-gradient methods). Moreover, they tend to require large numbers of evaluations of the likelihood function, whose computational cost increases exponentially with the order of a context-dependent model, and linearly with the number of sites considered. We will show in this section that an EM algorithm (Dempster, Laird, and Rubin 1977) is generally preferable to quasi-Newton methods for context-dependent models, although quasi-Newton methods can be useful for solving the optimization problem that arises on each iteration of the EM algorithm.

Suppose we seek the m.l.e. of (Q, ß) given {tau} and {pi}: (, ) = argmaxQ,ß P(z|Q, {pi}, {tau}, ß). The theorem behind the EM algorithm says that a (local) maximum of P(z|Q, {pi}, {tau}, ß) can be obtained by iterative maximization of the expected value of log P(z, z{circ}|Q, {pi}, {tau}, ß), computed with respect to the (posterior) distribution of Z{circ} (Dempster, Laird, and Rubin 1977). It can be shown (Appendix C) that the general EM update rule, by which estimates for iteration t + 1 are obtained from estimates for iteration t, here becomes


where E [S(b, a, u)|z, t, {pi}, {tau}, t] is the expected number of substitutions of b {Sigma}N for a {Sigma}N on the branch above u, given the observations at the leaves of the tree and the previously estimated parameters. Obtaining these values (the "E" step of the EM algorithm) can be accomplished with a procedure that is closely related to Felsenstein's algorithm (Appendix C). These expected values are then treated as constants when solving the remaining maximization problem (the "M" step).

This "inner" maximization problem remains nontrivial, but it can be solved numerically with the same kind of algorithm we have rejected for the larger problem of estimating (, ). Quasi-Newton methods (for example) are better suited for the inner problem than the outer one because the computational complexity of the function to be maximized is greatly reduced; in particular, the new function has no dependency on the length of the alignment (it requires O(nd 2N) time, versus O(nL'd2N) time for the complete likelihood function). The efficiency of the optimization procedure can be further improved by using analytical methods to obtain partial derivatives, approximating the derivatives with respect to rate-matrix parameters, and initializing the parameter values on each iteration with the solution of the previous iteration. The partial derivatives with respect to rate-matrix parameters can be derived using the Cauchy integral formula, as shown recently by Schadt and Lange (2002). See Appendix C for details.

Rate variation can also be accommodated with EM, using the discrete gamma model. In this case, the update rule becomes


where E[S(b, a, u, l)|z, t, {pi}, {tau}, t, t] is the expected number of substitutions of b {Sigma}N for base a {Sigma}N on the branch above u for rate category l. This quantity reflects the posterior probabilities of the rate categories at each site, as well as the posterior probabilities of substitutions. Note that the shape parameter {alpha} of the gamma distribution must be estimated as part of the inner maximization problem, and that the values r1, ... , rk are implicitly functions of this parameter. With rate variation, the cost of the E step is increased by a factor of k, and the EM algorithm converges more slowly. See Appendix C for further details.

Markov Dependence Between Sites
It is possible to allow for overlapping tuples of columns, and hence to capture context effects between all adjacent columns of the alignment x, by assuming Markov dependence of each column x•,i on its N - 1 predecessors. In this case, the likelihood function can be written as


and the probability P(x•,i|x•,i-N+1, ... , x•,i-1, {psi}) can be computed as


where {sum}•,i represents a sum over all possible columns of observed characters at site i (all possible values at the leaves of the tree). The joint probability in the numerator of equation 5 is simply the probability of an observed N-tuple of sites, and hence can be computed using an Nth order phylogenetic model. The sum in the denominator can be computed with the same model, using a slight adaptation of Felsenstein's algorithm that sums over all possible ancestral states at sites i - N + 1, ... , i, as usual, but also over all possible values of x•,i (see details in Appendix B). The new algorithm uses the same principle that is used in the version of Felsenstein's algorithm that allows for missing data (Appendix A), and as a result, it differs from Felsenstein's algorithm only in its initialization step. Thus, the conditional probability P(x•,i|x•,i-N+1, ... , x•,i-1, {psi}) can be computed with two passes through Felsenstein's algorithm, one for the numerator and one for the denominator of Equation 5, each with a different initialization strategy.

When Markov dependence between sites is combined with context-dependent substitution models, it is particularly useful to distinguish between sites of different types. Various functional categories of sites might be considered, for example, corresponding to first, second, and third codon positions, and noncoding regions. Alternatively, categories might correspond to different evolutionary rates, or to combinations of functional role and evolutionary rate (Siepel and Haussler 2004). In general, k categories are described by k phylogenetic models, {psi} = ({psi}1, ... , {psi}k), reflecting the different rates and/or patterns of substitution observed in the different categories. If each site can be assigned a category a priori, with c(i) being the category of the ith site (1 <= i <= L; 1 <= c(i) <= k), then in the 1st order case (N = 1) we have P(x|{psi}) = LP(x•,i|{psi}c(i)). In the general Nth order case, different context effects for different categories, as well as different rates and patterns of substitution, can be accommodated similarly, by altering equation 4 so that the probability of each x•,i is conditioned on {psi}c(i). For example, with N = 2 and k functional categories, equation 4 would be replaced by:


When the categories are not known a priori, it is possible to consider all possible assignments of sites to categories by treating the model as a hidden Markov model, with categories corresponding to states, and emissions corresponding to alignment columns. Such an approach also allows the category of each site to be predicted (Felsenstein and Churchill 1996; Yang 1995; Goldman, Thorne, and Jones 1996; Pedersen and Hein 2003; Siepel and Haussler 2003).

The Markov model presented here is clearly more limited in several respects than the process-based models of Jensen and Pedersen (2000; Pedersen and Jensen 2001). In particular, context effects cannot "cascade" in both directions along a single branch of the tree because inferences are restricted to N-tuples of bases along each branch. Furthermore, because the interdependencies are ignored between the ancestral states associated with overlapping column tuples, there is no globally consistent probabilistic treatment of the latent variables and their interdependencies. Nevertheless, the conditional distributions defined for this (N - 1)st order Markov process are valid probability distributions, and they legitimately combine to produce a valid likelihood: the sum of the probabilities of all alignments is 1 for a given L and n. Most importantly, the model allows efficient computation of likelihoods, and (as will be seen below) fits the data well.

Instead of estimating parameters separately for each functional category under the assumption of independent tuples, parameters can be estimated directly with respect to the entire Markov model (equation 4), using a quasi-Newton algorithm (the EM algorithm described above can no longer be used). It is reasonable to expect, however, that parameter estimates will be similar to those obtained under an assumption of independent tuples. This conjecture is examined experimentally later in this article and will be found to hold reasonably well.

Data
Our noncoding data set was drawn from a 1.8-megabase region of human chromosome 7 containing the CFTR gene and homologous sequences from eight other eutherian mammals (Thomas et al. 2003) (see species in fig. 4). The sequences were aligned with a new program called TBA (for "Threaded Blockset Aligner"; Blanchette et al. 2004), which was designed specifically for alignment of megabase-sized regions of multiple mammalian genomes. TBA builds a multiple alignment from local pairwise alignments, using a progressive alignment strategy and a predefined phylogenetic tree. It can "thread" a set of locally aligned blocks with respect to one sequence, such that every base of that sequence is represented exactly once, and all bases appear in order. In this case, the tree topology of Figure 4 was used, and the alignment was threaded with respect to the human sequence.



View larger version (19K):
[in this window]
[in a new window]
 
FIG. 4. The assumed tree topology with branch lengths in substitutions/site as estimated for the AR alignment (a) and the mRNA alignment (b), under the U3S and R3 models, respectively (with rate variation). b. The root of the tree has been arbitrarily placed at the midpoint of the branch between the primates/rodents and the other species (the estimated tree was unrooted). Branch lengths are drawn to scale, separately for each tree. Labels for branches appear to the left of each tree and are used in table 2

 
Sites were selected from this alignment corresponding to ancestral repeats (ARs)—transposons that appear to have been dispersed, and then become inactive, prior to the divergence of the species in question, and that are believed to have been evolving more or less neutrally since that time. Ancestral repeats were identified in the human sequence, using the program RepeatMasker (http://ftp.genome.washington.edu/cgi-bin/RepeatMasker/), and were mapped to the coordinate system of the multiple alignment. The corresponding columns were then extracted and combined into a subalignment, referred to below as the "AR alignment." Essentially the same set of repeat subfamilies was considered as in recent papers by the Mouse Genome Sequencing Consortium (2002) and Hardison et al. (2003).

The AR alignment was post-processed in various ways to eliminate artifacts from both the alignment procedure and the extraction of columns. To diminish the effects of alignment errors adjacent to indels (due to the placement of gaps to optimize an alignment score), a heuristic was used by which the 3 bases adjacent to each indel were discarded, as were maximal ungapped subsequences shorter than 15 bases in length (discarded bases were replaced by missing data characters—‘N's). Subsequently, columns having actual bases (non-gaps and non-‘N's) present in fewer than four species were also removed, so that sites at which two-thirds or more of the species either had gaps or bases bordering gaps were excluded from our analysis. Whenever one or more columns were removed, they were replaced with N - 1 columns of missing data (where N is the order of the substitution model to be applied), so that no artificial context was introduced (this "padding" procedure was also used when the AR sites were originally extracted from the complete alignment). The final, cleaned alignment consisted of 162,743 sites, with an average of 5.7 species represented at each site, and a set of padding columns about once every 36 sites. In the end, our results appeared not to be very sensitive to the alignment and post-processing methods (see Discussion).

For the coding data set, we used mRNA sequences from various mammalian species that match known genes in the human genome. (It is not yet possible to assemble a data set of coding sites from mammalian genomic DNA with comparable size and species diversity to the noncoding data set described above; the CFTR data set, for example, contains only about 20,000 sites in coding regions.) Using the alignments of nonhuman mRNAs to the human genome that are available from the UCSC Human Genome Browser (Kent et al. 2002), we selected those genes from a nonredundant set of RefSeq genes (Pruitt and Maglott 2001) having aligned mRNAs from at least two of eight (nonhuman) mammalian species. The same species were used as in the CFTR data set, and only autosomal genes were considered. The best matching mRNA from each species was selected for each gene, with a requirement that this mRNA matched the human genome significantly in no other place. (This rule acted approximately like a reciprocal-best criterion for orthology, but it was somewhat more conservative; it was a more natural choice here because of the asymmetry between the human genome and the nonhuman mRNAs, and because of the redundant nature of the mRNA data.) The coding exons of each RefSeq gene were extracted from the human genome, and re-aligned to the putatively orthologous mRNAs using the T-Coffee program (Notredame, Higgins, and Heringa 2000). The resulting multiple alignments were then cleaned by removing all sites upstream of the start codon and downstream of (and including) the stop codon in the human sequence, all sites with gaps, and (maximal) gapless segments of the alignment shorter than 30 sites in length. In addition, alignments were discarded in which frame-shift indels or premature stop codons occurred in any species (if the anomaly occurred in the last 20% of sites in the gene, only the downstream portion was discarded; the criterion for frame-shift indels was based on the total number of gap characters in each sequence between retained gapless blocks of the alignment, relative to the number in the human sequence). After the cleaning procedure, 2,441 alignments (genes) remained, with an average of 3.4 species per alignment (one of which was always human). Aside from the human genes themselves, this data set is dominated by rodent sequences (2,396 alignments include mouse or rat), and in particular by three-way human/mouse/rat alignments (1,375 alignments); however, cow is fairly well represented (695 alignments), as is pig (433 alignments). Chimp and baboon have the poorest representation, each being present in fewer than 30 alignments. For the purposes of fitting phylogenetic models, all 2,441 alignments were concatenated into a single large alignment consisting of 3,081,993 sites; the species that were absent for each gene were represented in the corresponding columns of this alignment by missing data characters (‘N's). This alignment has no gaps or stop codons and maintains reading frame completely, with every consecutive triplet of columns representing a column of codons. It is referred to below as the "mRNA alignment."


    Results
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material

 Acknowledgements
 Literature Cited
 
This section begins with results for the noncoding data set (the AR alignment), where all sites are treated equally, and then turns to results for the coding data set (the mRNA alignment), where it is important to account for the codon position of each site. In both cases, the likelihoods of various models will first be compared, then the estimates obtained for model parameters will be discussed.

Results for Noncoding Data
Various models, with orders from N = 1 to N = 3, were fitted to the AR alignment, both assuming constant rates across sites and allowing for rate variation (discrete gamma method, k = 4 rate categories). Parameters were estimated with the EM algorithm described above, assuming independence of column tuples (with context-dependent models, the sites of the alignment were arbitrarily partitioned into adjacent N-tuples; see below). In all cases, the tree was assumed to have the topology shown later in figure 4, which has been confirmed to be correct by the presence and absence of particular instances of interspersed repeats in the sequences of the CFTR data set (Thomas et al. 2003). Alignment gaps were treated as missing data. Recall that the alignment consisted of L = 162, 743 sites, and that alignment gaps and ‘N's accounted for roughly one-third of the characters in each alignment column (on average).

Model Likelihoods
The log likelihoods of all models, under the assumption of constant rates, are shown in Figure 1a. Values are plotted relative to the simplest model, HKY. A striking improvement can be seen as the order of the models is increased from N = 1 to N = 2, and again from N = 2 to N = 3, indicating strong context effects in the substitution process. Indeed, these improvements far exceed what is achieved by increasing parameter complexity in models of the same order. Nevertheless, most of the improvements among models of the same order do appear to be significant. In several cases, such models can be compared using the standard likelihood ratio test (Huelsenbeck and Rannala 1997), which is applicable between "nested" models, M1 {subseteq} M2; here, HKY {subseteq} REV {subseteq} UNR, and RNS {subseteq} RN {subseteq} UN ⊇ UNS, for N = 2, 3. With all comparable pairs of models, the improvement of the more complex one over the less complex one is statistically significant by this test (P < 0.0001), except in the case of U3 versus U3S (P = 0.2).



View larger version (13K):
[in this window]
[in a new window]
 
FIG. 1. a. Log likelihoods of various models with respect to the AR alignment (noncoding data), relative to the log likelihood of the HKY model, without rate variation (see table 1). Models with N = 2 and N = 3 are context dependent. Independent N-tuples of sites were assumed for parameter estimation and likelihood evaluation. b. Similar plot showing the Bayesian information criterion (BIC) in place of the log likelihood (BIC = -2·loglik + m log L, where m is the total number of parameters and L is the number of sites in the alignment; here, L = 162,743)

 
The likelihood ratio test is not applicable for models of different order, and it can be misleading with large data sets (given enough data, it is possible to obtain small P values for complex models that fit the data only slightly better than simpler counterparts). The Bayesian information criterion (BIC) (Schwartz 1979) is an alternative way of evaluating improvements in likelihood vis-à-vis increases in model complexity, which can be used for non-nested models, and which considers the size of the data set. The BIC strongly supports the use of second-order models over single-nucleotide models, and it strongly supports third-order models over second-order models (fig. 1b). Among second-order models, it finds strand symmetry (compare R2S and R2, U2S, and U2) to be a preferable constraint on the rate matrix to reversibility (compare R2S and U2S, R2, and U2), with U2S achieving the best (lowest) score overall. This relationship does not hold for the third-order models, where the cost of the additional parameters of unrestricted models outweighs the resulting improvements in likelihood, and the assumption of reversibility is supported. We note, however, that the BIC is known to tend to overpenalize complex models (Hastie, Tibshirani, and Friedman 2001).

Figure 2 shows, for selected models, the improvements in likelihood obtained by allowing rate variation and overlapping column tuples. Allowing rate variation with this data set makes only a small difference in likelihood (for most models, allowing rate variation is supported by the likelihood ratio test but not by the BIC), and leaves the relative scores of the models essentially unchanged (fig. 2a). Apparently, the rate of substitution is not highly variable in these AR sites, consistent with the assumption of neutral evolution. In contrast, a very large increase in likelihood is observed when Markov dependence between columns is introduced—indeed, the improvement over the single-nucleotide models is nearly doubled for dinucleotide models and increases by about half for nucleotide-triplet models, despite no change in the number of parameters of the model (fig. 2b). Notice that, if tuples are assumed independent, then context is ignored at every Nth boundary between adjacent sites; if Markov dependence is subsequently introduced, then N boundaries are considered in place of each N - 1, for a relative gain of (2 for dinucleotides and 1.5 for nucleotide triplets). This effect appears in large part to explain the observed improvements, and also explains why the gap between the U2S and U3S models closes substantially when Markov dependence is introduced (fewer boundaries are ignored for U3S when independent tuples are assumed). Note that the parameters of all models were estimated under an assumption of independent tuples, so that the likelihoods in figure 2b represent a lower bound on what can be obtained with the Markov-dependent models; direct optimization of the parameters of these models could improve likelihoods further (see below).



View larger version (12K):
[in this window]
[in a new window]
 
FIG. 2. The effect on log likelihoods of allowing rate variation (a) and Markov dependence between columns (b), for selected models (AR alignment). In both plots, the light bars correspond to bars of figure 1a, and the dark bars indicate the improvements obtained. All values are relative to the log likelihood of the HKY model without rate variation

 
Two additional experiments were performed to test whether the increased likelihoods of context-dependent models, with and without Markov dependence of sites, might still somehow be an artifact of the construction of the models. The first experiment was a twofold cross-validation test based on the same data set (the AR alignment). The alignment was divided in half, selected models were trained on approximately the first L/2 sites, and likelihoods were evaluated on the second L/2 sites (L = 162,743); then the procedure was reversed, with training performed on the second half and likelihoods evaluated on the first. The second experiment involved randomizing the columns of an alignment, so that context was effectively erased, then fitting various models to the randomized data, and comparing likelihoods with those obtained from the original alignment. This experiment was performed with the first half of the AR data. The results of these two experiments, shown in figure 3, confirm that the improved likelihoods of the context-dependent models are due to their ability to capture real properties of the biological data, and not to statistical artifacts.



View larger version (14K):
[in this window]
[in a new window]
 
FIG. 3. Results of the cross-validation (a) and column-randomization (b) experiments. a. The log likelihoods of selected models are shown with respect to a test data set (the second half of the AR alignment), after model parameters had been estimated with respect to a separate training data set (the first half). The relative improvements shown in the previous figures remain essentially unchanged. The other part of the two-fold cross-validation test produced similar results (not shown). b. Training log likelihoods are plotted for the first half of the AR alignment ("Original") and for a version of the same data set in which the columns had been randomly permuted ("Random"). The advantage of the context-dependent models is eliminated with the randomized data. Note the slight likelihood increase for the randomized data under the assumption of independence, and the decrease under the assumption of Markov dependence, which reflect parameter overfitting (models were trained under the independence assumption). This effect should be diminished with larger training sets. All values in both panels are plotted relative to the HKY model

 
Finally, we tested the conjecture that parameter estimates can reasonably be obtained under an assumption of independent tuples, and then applied in a model that assumes Markov dependence of sites. At the same time, we examined whether it made any difference how a data set was partitioned into independent N-tuples, and whether those tuples were nonoverlapping. We fitted the U2S model to the AR alignment in four different ways: by assuming independent tuples (pairs) and training on (1) the "odd" pairs (all adjacent pairs (x•,i, x•,i+1) such that i is odd) (2) the "even" pairs, (3) both the odd and even pairs (so that each site was actually considered twice); and (4) by optimizing the full Markov model directly (computationally expensive but possible with N = 2). The likelihood of each model was then evaluated on the same data under the opposite assumption—that is, Markov dependence in the first three cases and independent tuples in the fourth—and all likelihoods (training and testing) were compared under each assumption. The likelihoods were very similar in all cases (within about 75 units of log likelihood), as were the actual parameter estimates (results not shown).

Parameter Estimates
All parameter estimates discussed in this section were obtained with the EM algorithm, under the assumption of independent tuples. The estimates of branch-length parameters were very similar under all models (those for U3S with rate variation are shown in fig. 4a). Indeed, estimates for 2nd and 3rd order models differed from the values shown in figure 4a by an average of only 1% to 2%, and estimates for the HKY model without rate variation (the simplest model considered) differed from them by an average of only 5.2%. In general, branch-length estimates increased slightly with the number of parameters in the substitution model, and in particular with the introduction of rate variation, as has been noted in studies of independent-site models (Yang, Goldman, and Friday 1994). Estimates of the shape parameter {alpha} of the gamma distribution were very similar for models of the same order, but they increased slightly with model order (from about 8 for N = 1 to about 11 for N = 3). The ratio of the expected transition rate to the expected transversion rate, ts/tv (Appendix B), was very similar for models of the same order, but it increased slightly with model order (2.03 for N = 1, 2.12–2.14 for N = 2, and 2.16–2.17 for N = 3).

More interesting are the estimates of specific rate-matrix parameters, which provide insight about context-dependent substitution in noncoding DNA. For context-dependent models, the elements of the estimated rate matrix Q (which, in the case of reversible models, also incorporate the equilibrium frequencies) spanned a wide range of values (e.g., 0.05–10.9 for the U3S model) and tended to cluster into three main groups, corresponding to transversions, transitions, and CpG transitions. CpG transversions (CG->AG and CG->GG, and their reverse complements) occurred at rates comparable to those of non-CpG transitions. Comparisons of context-dependent models with independent-site models (fig. 5) confirmed that the most pronounced context effects corresponded to CpG transitions, although CpG transversions also occurred at a significantly higher rate than expected (note that the mechanism behind CpG transversions is believed to be different from that behind CpG transitions; see, e.g., Blake, Hess, and Nicholson-Tuell 1992). CpG effects also appear to explain, at least in part, the somewhat poor fit of reversible, context-dependent models (as described above). When estimates of substitution rates under the U3 and U3S models are compared (fig. 6a), they are seen to be very closely correlated, indicating that the constraint of strand symmetry does not prohibit a near-optimal version of the matrix from being obtained. A similar comparison of the U3 and R3 models, however (fig. 6b), shows substantially poorer agreement for certain classes of substitutions, particularly CpG transversions (r = 0.55 for CpG transversions under U3 vs. U3S and r = 0.36 for CpG transversions under U3 vs. R3). The rates of these transversions appear to be systematically overestimated under R3, and the rates in the reverse direction appear to be underestimated.



View larger version (13K):
[in this window]
[in a new window]
 
FIG. 5. Substitution rates for the AR alignment as estimated under the U2 model versus rates estimated under the UNR model (unrestricted with independent sites). Each point represents a non-zero, non-diagonal element in the 16 x 16 dinucleotide rate matrix in the vertical dimension, and the corresponding element in the 4 x 4 single-nucleotide rate matrix in the horizontal dimension. The line y = x is shown for reference. Points corresponding to CpG substitutions on either strand are circled (substitutions of the form CG->bG or CG->Cd, where b,d {A,C,G,T}, b != C, d != G). Similar plots were obtained for other models, with both N = 2 and N = 3. The large differences in estimated transition rates for the UNR model (two rates are ~0.5 and two are ~0.9) appears to be mostly a consequence of G + C content (about 39% here)—transitions to G and C are estimated to occur at a lower rate than transitions to A and T

 


View larger version (22K):
[in this window]
[in a new window]
 
FIG. 6. Substitution rates estimated for the AR alignment under the U3 and U3S models (a), and under the U3 and R3 models (b). a. CpG substitutions (aCG->abG or CGa->bGa, where a, b {A, C, G, T} and b != C) are indicated by circled points and their strand-symmetric counterparts by points enclosed in diamonds. b. CpG substitutions on either strand (aCG->abG, CGa->bGa, aCG->aCd, CGa->Cda, where b != C, d != G) are indicated by circles, and the reverse substitutions are indicated by diamonds. In both a and b the tendency of the rates to group into (non-CpG) transitions, (non-CpG) transversions, and CpG transitions is evident, with CpG transversions clustering with non-CpG transitions

 
It is worth noting that our results do not reflect the transcription-associated mutational asymmetry recently reported by Green et al. (2003), based on an analysis of the same sequence data, although a substantial portion of the sites we have considered are believed to be transcribed. Three of the nine genes in the region in question, however, are located on the opposite strand from the other six, and so our data set contains a mixture of bases not only that are and are not transcribed, but that correspond to the transcribed and nontranscribed strands. This is presumably why strand asymmetries are not apparent from a comparison of the UN and UNS models.

The estimated rates for the U3S model were compared to rates of context-dependent substitution in human pseudogenes estimated by Hess, Blake, and Blake (1994), in an expansion of the study by Blake, Hess, and Nicholson-Tuell (1992). The reported "corrected relative rates" were used, which are ratios with respect to the slowest estimated rate, with an upward correction applied to CpG substitutions. The results of the comparison are shown in figure 7. The two sets of estimates are remarkably consistent overall, considering the difference in methods and our use of ARs from a single, local region of the genome instead of pseudogenes from various chromosomes. The largest disagreements correspond to CpG transitions, for which we estimate considerably higher rates (1.8–2.8 times higher than would be predicted by the regression line of figure 7). These disagreements may reflect real differences in the two data sets—e.g., due to differences in the methylation patterns in AR sites versus pseudogenes—but it is likely that they are at least partially a consequence of a non-negligible rate of multiple substitutions per site, or of ascertainment bias in Hess, Blake, and Blake's (1994) methods (their selection procedure, which was designed to eliminate cases of multiple substitutions per site, may have been biased against sites with CpG transitions).



View larger version (19K):
[in this window]
[in a new window]
 
FIG. 7. Substitution rates for the AR alignment as estimated under the U3S model versus relative rates in human pseudogenes reported by Hess, Blake, and Blake (1994). Points corresponding to CpG substitutions are circled, and points corresponding to TpA substitutions are enclosed in diamonds. Strand symmetry has been assumed for both sets of estimates, so only one point appears for each strand-symmetric pair of substitutions. The displayed curve is the linear regression line for non-CpG substitutions (note the log-log scale)

 
No simple explanations emerged for the variation in the estimates of non-CpG rates, which was considerable, both among transitions and among transversions (non-CpG transition rates for U3S had mean 0.653 and s.d. 0.203 and non-CpG transversion rates had mean 0.161 and s.d. 0.057). Indeed, several reported trends were not supported by our parameter estimates. For example, Hess, Blake, and Blake (1994) observed high rates of TA->CA transitions, but we did not find these rates to be unusual (fig. 7). Our estimates also did not show a significant dependency of the transition/transversion ratio on the A + T content of neighboring bases: the expected proportion of substitutions that are transversions, based on our estimated rate matrix, increased only slightly with the number of flanking A + T bases (under the U3 model, from 0.275 with 0 flanking A + T bases, to 0.290 with 1, and to 0.349 with 2), and this increase seemed to be largely a consequence of the CpG effect (when CpG substitutions were discarded the expected numbers were 0.334 for 0 A + T bases, 0.320 for 1, and 0.349 for 2). (We note that the proportions we estimated including CpG substitutions are comparable to those reported for primates by Zhao and Boerwinkle (2002) and Yang, Chen, and Li (2002), who appear not to have corrected for the CpG effect.) Our estimated rates also did not reflect the tendency observed in chloroplast genomes for a higher proportion of transversions when a 5' pyrimidine is present (Morton, Oberholzer and Clegg 1997). As has been noted for chloroplast genomes (Morton, Oberholzer, and Clegg 1997), the pattern of context-dependent substitution in the noncoding DNA of mammals appears to be complex, and is not easily explained in terms of a small number of well-defined phenomena (such as the CpG effect and the transition/transversion bias).

Results for Coding Data
A different set of models was fitted to the mRNA alignment, this time including the GYE and GYM versions of Goldman and Yang's (1994) codon model (Appendix B), as well as the HKY, REV, UNR, R3, and U3 models. Second-order models and strand-symmetric models were not considered. All models were fitted with and without rate variation. The EM algorithm was used, assuming independent column tuples (here corresponding to codons, for the third-order models), except for Goldman and Yang's models, which were fitted using the codeml program in the PAML package (Yang 1997). The same tree topology was assumed as in the previous section. Recall that the mRNA alignment consisted of 993 sites, L = 3, 081, with 3.4 species represented per site; sites with alignment gaps had been removed.

For the R3 and U3 models, the effect of assuming Markov dependence between columns was also examined. In this case (unlike in the previous section), three separate functional categories were considered, corresponding to the 1st, 2nd, and 3rd codon positions, and a separate third-order model was fitted to the sites of each category, using the EM algorithm. Thus, the third codon position model was treated like an ordinary codon model, but the first and second position models were trained on triples that straddled adjacent codons, so that context effects across codon boundaries were considered. All three models were incorporated into the likelihood calculation, as shown in equation 6.

Model Likelihoods
Figure 8 shows the log likelihoods and BIC scores of all models, with and without rate variation. The codon models are seen to fit the data overwhelmingly better than the single-nucleotide models, indicating pronounced context-dependence among sites of the same codon. The R3 and U3 models, however, performed substantially better than Goldman and Yang's model; apparently, the pattern of codon substitution is complex, and is characterized only approximately by these simply parameterized models. The use of a physicochemical distance matrix (GYM) produced a relatively small improvement over the naive assumption of equal distances between amino acids (GYE), supporting Yang, Nielsen, and Hasegawa's (1998) conclusion that physicochemical distances and substitution rates are not strongly correlated (see below). We expect that the R3 and U3 models would improve similarly on other existing codon models, which have been reported to perform roughly as well as Goldman and Yang's model (Schadt and Lange 2002).



View larger version (16K):
[in this window]
[in a new window]
 
FIG. 8. a. Log likelihoods of various models with respect to the mRNA alignment (coding data), with and without rate variation. N-tuples of sites were assumed independent in parameter estimation and likelihood evaluation. b. Similar plot showing the BIC in place of the log likelihood (see legend of figure 1). The number of sites here is L = 3,081,993

 
Allowing for rate variation made a large difference for all models, but a larger difference for single-nucleotide models and Goldman and Yang's codon model than for R3 and U3; the reason is probably that rate variation is being used to compensate for an oversimplified representation of substitution patterns (estimates of the shape parameter {alpha} were about 0.40 under the single-nucleotide models, 0.95 under GYE and GYM, and 1.4 under R3 and U3).

The U3 model outscored the R3 model by about 2,000 units of log likelihood (with and without rate variation), a sufficient improvement to reject the hypothesis of reversibility under the likelihood ratio test. The R3 and U3 models are almost equal according to the BIC, however, and indeed, R3 is slightly preferable in the case with rate variation. Despite some evidence against reversibility of codon substitution, it appears to be a reasonable modeling assumption, especially when its mathematical convenience is taken into consideration.

When Markov dependence between sites was introduced with the U3 and R3 models (fig. 9a), another large improvement in likelihoods occurred, indicating that context effects that cross codon boundaries are important (see also Pedersen, Wiuf, and Christiansen 1998). This improvement holds up well under cross-validation (fig. 9b). It remains to be determined to what extent the CpG effect is responsible for the observed improvements in likelihood (e.g., because of synonymous transitions at CpG), and in general, how similar the context effects in coding DNA (particularly at unconstrained sites) are to the ones observed in putatively neutral DNA.



View larger version (16K):
[in this window]
[in a new window]
 
FIG. 9. a. Log likelihoods of various models with respect to the mRNA alignment, with and without assuming Markov dependence of sites. All values are relative to the log likelihood of the HKY model without rate variation. The Markov model is not compatible with Goldman and Yang's model, which can only be applied to in-frame codons. For the Markov-dependent REV model, separate models were fitted to sites of each codon position, and their log likelihoods were combined. b. Similar plot showing cross-validation results, with models trained on the first half of the mRNA alignment, and likelihoods evaluated on the second half. (The training likelihood for the second half of the alignment is shown for the GYM model, because software was not available to train and test on different data sets; this value represents an upper bound on what would be obtained under cross-validation.) The relative scores of the models remain essentially unchanged

 
Parameter Estimates
The estimated branch lengths based on the mRNA alignment were similar for all models, but they were less consistent than the estimates for the noncoding data set. Results for selected models are shown in Table 2, with branch-length labels defined in figure 4b. (Throughout this section, estimates are presented for the versions of R3 and U3 that were trained like ordinary codon models.) The estimates for single-nucleotide models differed considerably at several branches from those for the codon models, and Goldman and Yang's model produced estimates at a few branches that differed significantly from those of R3 and U3. Most of the major differences, however, appeared in the subtree of the primates, and they should be interpreted with caution, because of sparse data for chimp and baboon, and because noise in the mRNA sequences will have a disproportionate effect on short branches. In general, these results suggest that simply parameterized codon models, and even independent-site models with rate variation, may be adequate for branch length estimation, despite the improved fit of richer models.


View this table:
[in this window]
[in a new window]
 
Table 2 Branch Lengths Estimated for the mRNA Alignment Under Selected Models (substitutions/site).