MBE Advance Access originally published online on May 30, 2003
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Mol. Biol. Evol. 20(9):1377-1419. 2003
DOI: 10.1093/molbev/msg140
© 2003 by the Society for Molecular Biology and Evolution. ISSN: 0737-4038
Review Article |
The Evolution of Transcriptional Regulation in Eukaryotes
Department of Biology, Duke University
E-mail: gwray{at}duke.edu.
| Abstract |
|---|
Gene expression is central to the genotype-phenotype relationship in all organisms, and it is an important component of the genetic basis for evolutionary change in diverse aspects of phenotype. However, the evolution of transcriptional regulation remains understudied and poorly understood. Here we review the evolutionary dynamics of promoter, or cis-regulatory, sequences and the evolutionary mechanisms that shape them. Existing evidence indicates that populations harbor extensive genetic variation in promoter sequences, that a substantial fraction of this variation has consequences for both biochemical and organismal phenotype, and that some of this functional variation is sorted by selection. As with protein-coding sequences, rates and patterns of promoter sequence evolution differ considerably among loci and among clades for reasons that are not well understood. Studying the evolution of transcriptional regulation poses empirical and conceptual challenges beyond those typically encountered in analyses of coding sequence evolution: promoter organization is much less regular than that of coding sequences, and sequences required for the transcription of each locus reside at multiple other loci in the genome. Because of the strong context-dependence of transcriptional regulation, sequence inspection alone provides limited information about promoter function. Understanding the functional consequences of sequence differences among promoters generally requires biochemical and in vivo functional assays. Despite these challenges, important insights have already been gained into the evolution of transcriptional regulation, and the pace of discovery is accelerating.
Key Words: binding site enhancer evolution of development genotype-phenotype relationship promoter transcription factor
| 1 Introduction |
|---|
A gene embedded in random DNA is inert. In the absence of sequence motifs and proteins capable of directing transcription, the protein it encodes will remain invisible to selection. Every gene with a phenotypic impact is flanked by regulatory sequences that, in conjunction with the expression and activity of proteins encoded elsewhere, regulate when expression occurs, at what level, under what environmental conditions, and in which cells or tissues. Transcriptional regulatory sequences are as important for gene function as the coding sequences that determine the linear array of amino acids in a protein.
Transcriptional regulation is also a crucial contributor to evolutionary change in the genotype-phenotype relationship. Understanding the dynamic link between genotype and phenotype remains a central challenge in evolutionary biology (Wright 1982; Raff 1996; Wilkins 2002). Enormous advances have been made during the past few decades in understanding the dynamics of alleles within populations, the role of genes during development, and the evolution of phenotype. Although these studies have progressed along nearly independent paths (for historical perspectives, see Raff [1996] and Wilkins [2002]), they have recently begun to intersect in fruitful and exciting ways in studies of gene expression. This work is making substantial contributions to the understanding of how the genotype-phenotype relationship evolves.
The goal of this review is to bring transcriptional regulation into the mainstream of molecular evolution. We are concerned here with promoters (cis-regulatory sequences that influence transcription) and transcription factors (proteins that interact with these sequences). Throughout, we emphasize three general points. First, changes in transcriptional regulation comprise a quantitatively and qualitatively significant component of the genetic basis for evolutionary change. Second, understanding how transcriptional regulation evolves requires a clear grasp of how the relevant macromolecules interact and function in living cells. And third, studying the evolution of transcriptional regulation poses unique and significant challenges to both empirical and analytical approaches. These challenges are balanced, however, by extraordinary opportunities to extend and deepen our understanding of the genetic basis for phenotypic evolution.
| 2 Why Promoter Evolution Matters |
|---|
Several recent reviews have argued that changes in transcriptional regulation constitute a major component of the genetic basis for phenotypic evolution (Doebley and Lukens 1998; Carroll 2000; Stern 2000; Tautz 2000; Theissen et al. 2000; Purugganan 2000; Wray and Lowe 2000; Carroll, Grenier, and Weatherbee 2001; Davidson 2001; Wilkins 2002). Although the authors reached similar conclusions, they provided limited evidence to support the claim that mutations affecting transcriptional regulation have important evolutionary consequences. In this section we therefore review the theoretical arguments and empirical evidence that transcriptional regulation plays a pervasive and important role in evolution.
2.1 Theoretical Arguments: Why Promoters Ought to Contribute to Phenotypic Evolution
Before direct evidence was available, a few far-sighted biologists argued on the basis of first principles that changes in gene expression should constitute an important part of the genetic basis for phenotypic change (Jacob and Monod 1961; Wallace 1963; Zuckerkandl 1963; Britten and Davidson 1969, 1971; King and Wilson 1975; Wilson 1975; Jacob 1977; Raff and Kaufman 1983). Their arguments were based in part on the realization that the phenotypic impact of a gene is a function of two distinct components: the biochemical activity of the protein it encodes and the specific conditions under which that protein is expressed and is therefore able to exert its activity. During subsequent decades, the field of molecular evolution focused on the evolutionary implications of the first component of function, while developmental biologists were more concerned with the functional implications of the second. The revival of "evo-devo" has focused attention on a more integrative view that encompasses both protein function and gene expression (Raff 1996; Wilkins 2002).
Four additional considerations suggest that transcriptional regulation ought to be evolutionarily important. (1) Significant phenotypes. Many authors have commented on the direct relationship between when or where a gene is expressed and the functionally significant phenotypes that might result from changing these parameters (Raff and Kaufman 1983; Gerhart and Kirschner 1997; Carroll 2000; Davidson 2001; Wilkins 2002). For instance, earlier expression of a hormone might result in accelerated growth, whereas ectopic expression of a transcription factor might result in a duplicated structure. Importantly, these phenotypic transformations can be independent of changes in protein sequences. Changes in precisely how transcription is regulated can also have significant phenotypic consequences (Paigen 1989). For instance, synthesizing a digestive enzyme in response to feeding or resource availability might prove advantageous compared with continuous production (Jacob and Monod 1961). Such changes may form the basis of polyphenism and phenotypic plasticity (Schlichting and Pigliucci 1998; Gilbert 2001). (2) Coordinated pleiotropy. Because the proteins that regulate transcription interact with batteries of functionally related genes, a mutation affecting the function or expression of a transcription factor can potentially produce a coordinated phenotypic response (Raff and Kaufman 1983; Gerhart and Kirschner 1997; Carroll, Grenier, and Weatherbee, 2001; Wilkins 2002). Mutations in the expression of transcriptional regulators are therefore not simply more pleiotropic, they are more likely to produce functionally integrated phenotypic consequences. (3) The "Hox paradox." The discovery that many developmental regulatory genes and their expression profiles are phylogenetically widespread within the plant and animal kingdoms (Gerhart and Kirschner 1997; Carroll et al. 2001) raises an obvious problem: How do orthologous regulatory proteins pattern anatomically disparate organisms? At least part of the answer seems to lie in evolutionary reorganization of gene networks, such that many interactions between these proteins and the collection of genes that they regulate has changed since flies and mice last shared a common ancestor (Wray and Lowe 2000; Davidson 2001; Wilkins 2002). (4) Evolvability. Promoters may be more "evolvable" than coding regions (Gerhart and Kirschner 1997; Stern 2000; Carroll, Grenier, and Weatherbee 2001; Wilkins 2002). Many promoters are organized into functional modules, each of which produces a discrete aspect of the overall expression profile (Arnone and Davidson 1997), confining pleiotropy and allowing selection to modify discrete aspects of the overall expression profile independently. In addition, many promoter alleles are likely to be codominant and thus immediately visible to selection, increasing the efficiency with which beneficial alleles are fixed and deleterious ones are eliminated.
2.2 Mutations in Transcriptional Regulation Influence Phenotype
Transcriptional regulation is an integral component of the way genotype is converted into phenotype. Many mutants that have emerged from genetic screens for developmentally important genes involve defects in transcriptional regulation (Wilkins 1993, 2002; Gilbert 2000). The four-winged fly that results from certain mutations in Ubx in Drosophila is perhaps the most famous: some mutations located in regulatory sequences affect the transcription profile, and others locating in exons alter the function of the protein in regulating the transcription of other genes (Bender et al. 1983; Simon et al. 1990). The phenotypic consequences of some Ubx promoter mutations are so distinct that they were originally thought to represent separate genes (Lewis 1978).
Numerous studies have documented correlations between gene expression and anatomy. (1) Induced mutations. The phenotypes of some induced mutations mimic natural differences between species. Examples include homeotic mutations in Drosophila melanogaster that mimic segment and appendage number and identity characteristic of other insects (Raff and Kaufman 1983; Carroll 1995), mutations in Arabidopsis thaliana and Antirrhinum majus that mimic the floral anatomy of other angiosperms (Lawton-Rauh et al. 2000), and mutations in Caenorhabditis elegans that mimic the tail anatomy of other nematodes (Fitch 1997). Because most of these induced mutations generally do not replicate the genetic basis for natural phenotypic differences (Carroll 1995; Budd 1999), however, convincing evidence of the evolutionary significance of changes in transcriptional regulation must come from natural cases. (2) Comparisons of expression. In many cases, a gene required for the development of a trait in one species shows a difference in expression in other species that correlates with a difference in that trait (e.g., Burke et al. 1995; Brakefield et al. 1996; Dudareva et al. 1996; Sinha and Kellogg 1996; Averof and Patel 1997; Stockhaus et al. 1997; Abzhanov and Kaufman 2000; Kopp et al. 2000; Yamamoto and Jeffery 2000; Beldade, Brakefield, and Long 2002; Bharathan et al. 2002; Hariri et al. 2002). A causal relationship is plausible but not proven in these cases, because comparisons of gene expression cannot by themselves demonstrate that a change in transcriptional regulation is the genetic basis for a phenotypic difference. (3) Quantitative genetics. Anatomical changes that accompanied the domestication of maize from teosinte are due in part to changes within the inferred promoter region of a single gene encoding the transcription factor teosinte-branched (Wang et al. 1999). Although this is a case of artificial selection, it involved natural (rather than induced) genetic variation. Some differences in bristle patterns among Drosophila species are attributable to changes in promoter sequences (Stern 1998; Skaer and Simpson 2000; Sucena and Stern 2000). In other cases, genetic variation in gene expression levels shows strong associations with specific organismal phenotypes (Gerber, Fabre, and Planchon, 2000; Karp et al. 2000; Beldade, Brakefield, and Long 2002). Unfortunately, because of the confounding effects of linkage disequilibrium, quantitative genetics generally lacks the resolution to identify precise sequence differences that are responsible for particular phenotypes. When combined with experimental tests or case associations, however, specific sequence variants can be identified (Cooper 1999). Using this approach, more than 160 segregating promoter variants that influence transcription have been identified in humans (Cooper 1999; Rockman and Wray 2002), and several have been identified in Drosophila melanogaster (e.g., Robin et al. 2002).
2.3 Natural Populations Harbor Considerable Functional Variation in Gene Expression
Many examples of variation in gene expression are known from natural populations. (1) Spatial extent of expression. In rainbow trout, an allele of PGM1 conferring expression in the liver is associated with faster prehatching growth (Allendorf, Knudsen, and Phelps 1982; Allendorf, Knudsen, and Leary 1983). The spatial expression of amylase in the midgut varies within both Drosophila melanogaster and D. pseudoobscura; the genetic basis in both cases is trans and responds to artificial selection in D. pseudoobscura (Abraham and Doane 1978; Powell 1979; Powell and Licthenfels 1979). The spatial extent of expression of the transcription factor Distal-less within the wing of the butterfly Bicyclus anynana varies in correlation with wing color pattern, and it also responds to artificial selection (Beldade, Brakefield, and Long 2002). (2) Level of expression. Intraspecific differences in expression have been noted for GPDH in both larvae and adults of D. melanogaster (Laurie-Ahlberg and Bewley 1983); ß-glucuronidase in Mus domesticus (Pfister et al. 1982; Bush and Paigen 1992); Cyp6g1, a cytochrome P450 family gene, in D. melanogaster (Daborn et al. 2002); and prolactin in the teleost Oreochromis niloticus (Streelman and Kocher 2002). In all four cases, most or all of the polymorphisms described are in cis. Many additional examples are known from humans, where nearly two-thirds of the known functional polymorphisms in cis-regulatory sequences have a greater than twofold impact on transcription rates (Rockman and Wray 2002). (3) Inducibility of expression. Inducibility of amylase expression in response to a starch diet varies within D. melanogaster and responds to artificial selection (Matsuo and Yamazaki 1984; Klarenberg, Sikkema, and Scharloo 1987); expression of ß-glucuronidase in response to androgen varies within Mus domesticus (Bush and Paigen 1992); and three different mobile element insertions into the promoter of hsp70 reduce transcription in response to thermal stress in D. melanogaster populations (Lerman et al. 2003). Several other examples of variation in inducibility are known from humans (Rockman and Wray 2002). In the human and hsp70 cases, the genetic basis is known to reside in cis.
Additional studies have estimated the extent of heritable genetic variation in gene expression within populations. (1) Protein-based surveys. Several studies have measured levels of variation in gene expression from 1- or 2-dimensional protein gels in a variety of organisms: Zea mays (Burstin et al. 1994; Damerval et al. 1994; de Vienne et al. 2001), Pinus pinaster (Costa and Plomion 1999), Glycine max (Gerber, Fabre, and Planchon 2000), Mus musculus (Klose et al. 2002), and Homo sapiens (Enard et al. 2002a). Studies with the first three organisms documented that protein abundance has a strong genetic component, and all of these studies found that populations contain considerable variation in expression level for most of the proteins surveyed. In D. melanogaster, chromosome substitution lines show substantial levels of variation in gene expression as measured by enzyme activities (Laurie-Ahlberg et al. 1980; Wilton et al. 1982; Clark 1990). Although protein abundance and enzyme activity are indirect indices of transcription, these results suggest considerable genetic variation for gene expression in general. (2) mRNA-based surveys. More direct estimates of variation in transcription come from microarray analyses that survey thousands of loci. Studies in mice (Karp et al. 2000; Schadt et al. 2003), humans (Schadt et al. 2003), the teleost Fundulus heteroclitus (Oleksiak, Churchill, and Crawford 2002), D. melanogaster (Jin et al. 2001; Rifkin, Kim, and White 2003), Zea mays (Schadt et al. 2003), and Saccharomyces cerevisiae (Cavalieri, Townsend, and Hartl 2000; Brem et al. 2002), all indicate that genetic variation in transcript abundance is pervasive within populations. Much of this variation may be heritable. Schadt et al. (2003) found that 33% of the 23,574 loci surveyed from a cross of two inbred strains of mice showed a genetic component for expression differences within the liver, 29% of the 2,726 loci surveyed from 56 humans belonging to four families showed a heritable difference in expression within lymphoblasts, and 18,805 genes consistently differed in transcription within ear leaf tissue among progeny from a cross of two maize strains. What proportion of the genetic basis for this variation resides in the promoters of the genes showing transcriptional variation (cis) or in the sequences or expression profiles of their upstream regulators (trans) has been examined in a few cases. Quantitative trait loci (QTL) underlying variation in expression of at least 32% of 570 variably expressed transcripts in yeast mapped in cis (Brem et al. 2002), whereas the comparable fraction of genes with cis-acting QTL in mouse liver is even higher (Schadt et al. 2003). Reverse transcriptase polymerase chain reaction (RT-PCR) offers more reliable quantitation than microarrays, and it also provides a means of directly comparing transcription rates among alleles. In a preliminary survey of 69 loci in four inbred lines of Mus musculus, Cowles et al. (2002) found quantitative and tissue-specific variation among alleles at 4 loci. Using a similar approach, Yan et al. (2002) found evidence of variation in gene expression at 6 of 13 loci examined in humans. Taken together, microarray and RT-PCR surveys of mRNA levels provide solid evidence of abundant genetic variation in transcriptional regulation in diverse species, and they suggest that much of this variation resides in cis regulatory sequences. (3) Detailed analyses of promoter function. The most extensive direct evidence of functional variation in promoter sequences now available comes from humans, where many specific polymorphisms have been identified through direct functional studies (Cooper 1999). Although the human genome is not particularly polymorphic, a typical individual is estimated to be heterozygous for a functional promoter polymorphism at
40% of all loci (Rockman and Wray 2002). Comparable data do not yet exist for other species, but RT-PCR surveys (Cowles et al. 2002; Yan et al. 2002) provide a rapid means of estimating heterozygosity that affects transcription at many loci.
2.4 Natural Selection Operates on Allelic Variation in Promoters
Evidence for natural selection on eukaryotic promoter alleles comes from a variety of sources (also see section 4.7). (1) Human populations. Promoter polymorphisms at numerous loci in humans have functional consequences that influence diverse aspects of physiology, behavior, anatomy, and life history (Cooper 1999; Rockman and Wray 2002). Some of these promoter alleles have likely fitness consequences (for examples, see next paragraph and section 4.7). (2) Wild populations. A latitudinal cline of LDH promoter allele frequencies in the teleost Fundulus heteroclitus is probably maintained by temperature differences (Crawford, Segal, and Barnett 1999; Segal, Barnett, and Crawford 1999). Two other cases, mentioned earlier, are known from D. melanogaster: promoter alleles segregating at both Cyp6G1 and hsp70 appear to be under selection in wild populations (Daborn et al. 2002; Lerman et al. 2003). (3) Artificial selection and experimental evolution. Domestication of maize involved selection on the inferred regulatory region of the tb locus (Wang et al. 1999). Studies with yeast point to regulation of transcription as a critical component of adaptive change. Adaptation of Saccharomyces cerevisiae to glucose limitation was accompanied by twofold or greater changes in the abundance of transcripts from nearly 10% of all genes, consistently across replicates (Ferea et al. 1999). The evolution of drug resistance in experimental populations of Candida albicans correlated with overexpression of the four known resistance genes (Cowen et al. 2000). (4) Sequence comparisons. More extensive, but less direct, evidence that natural selection acts on promoters comes from cases of apparent evolutionary conservation of cis-regulatory sequences among distantly related species (for examples, see section 4.1). Consistent underrepresentation of specific sequence motifs provides evidence for genome-wide selection to remove spurious transcription initiation sequences in a broad diversity of prokaryotes (Hahn, Stajich, and Wray 2003).
Several examples of natural selection operating on transcriptional regulation involve pathogen-host interactions. For instance, some promoter alleles in Mycobacterium tuberculanum and hepatitis B alter transcription to the pathogen's benefit and may be under positive selection (Buckwold et al. 1997; Rinder et al. 1998; Lee et al. 2000; Kajiya et al. 2001). The origin and subsequent fixation of these mutations in separate host individuals demonstrates the ability of positive selection to operate in a predictable way on genetic variation within a promoter. Specific variants within the human immunodeficiency virus (HIV) promoter, including gains of binding sites for host nuclear factor kappa-B (NF-kB) and upstream stimulatory factor (USF), as well as functional modifications in the basal promoter, cause differences in the level of viral transcription (Montano et al. 1997; Jeeninga et al. 2000). The E subtype of HIV has significantly increased transcription rates and has gone to near fixation locally in southern Africa; it is associated with increased levels of secondary infections and may be under positive selection to the pathogens' advantage (Montano et al. 2000; Hunt, Johnson, and Tiemesse 2001). Conversely, human populations harbor promoter variants that influence susceptibility to pathogens or disease progression after infection. Because human generation times are much longer than those of pathogens, signatures of selection are more difficult to detect. Nonetheless, promoter allelles at TNF
, IL-4, IL-10, FY, CCR5, and TGFß influence mortality from a variety of viral, bacterial, and protoctistan pathogens and are likely to be under selection (Tournamille et al. 1995; Hamblin and Di Rienzo 2000; Shin et al. 2000; Thurz 2001; Bamshad et al. 2002; Meyer et al. 2002; Nakayama et al. 2002; Vidigal, Gemner, and Zein 2002). Some promoter alleles confer protection from one pathogen while increasing susceptibility to another (e.g., TNF
-380A: Meyer et al. 2002), raising the possibility of balanced polymorphisms.
2.5 Divergence in Promoter Function May Contribute to Reproductive Isolation
Changes in transcriptional regulation may also be important in speciation. The Dobzhansky-Muller model of speciation requires interspecific differences at pairs of interacting loci (Dobzhansky 1936; Muller 1942). Because of the large number of highly specific interactions that occur between proteins and DNA within promoters, these regions represent likely sites for postzygotic isolation resulting from multilocus epistasis (Johnson and Porter 2000). Empirical support comes from genetic loci that are involved in reproductive isolation. Only four such loci have been identified definitively, and all have turned out to involve changes in transcriptional regulation: the coding sequence of the transcription factor Odysseus within the genus Drosophila (Ting et al. 1998); promoter sequences of Xmk2 and CKDN2X within the teleost genus Xiphophorus (reviewed in Orr and Presgraves 2000); and a promoter polymorphism in desaturase 2 of D. melanogaster that is correlated with intraspecific differences in mating behavior and may be involved in premating isolation (Fang, Takahashi, and Wu 2002).
| 3 Transcriptional Regulation in Eukaryotes |
|---|
The familiar regularities that characterize coding sequences, in particular the genetic code, are absent from promoters. Understanding the functional consequences of evolutionary differences in promoter sequences therefore requires a clear knowledge of the mechanisms of transcriptional regulation. In this section, we review the structure and function of eukaryotic promoters. The literature on this topic is vast, and the emphasis here is on features directly pertinent to promoter evolution. Our focus is on the transcription of protein-coding loci, which comprise the majority of genes in eukaryotic genomes and about which the most information is available. Transcriptional regulation in Eubacteria is distinct in many ways (Struhl 1999; Lewin 2000), whereas in Archaea it is not particularly well understood (although the latter shares many features with eukaryotic regulation: Bell and Jackson 1998; Weinzierl 1999). Neither prokaryotic group is covered in this review. For more detailed reviews of mechanisms of eukaryotic transcriptional regulation see Latchman (1998), Weinzierl (1999); Carey and Smale (2000), Lee and Young (2000), Lewin (2000), Davidson (2001), Locker (2001), and White (2001).
3.1 Promoters and Gene Expression
Only some of the genes in a eukaryotic cell are expressed at any given moment. The proportion and composition of transcribed genes changes considerably during the life cycle, among cell types, and in response to fluctuating physiological and environmental conditions (e.g., White et al. 1999; Iyer et al. 2001; Kayo et al. 2001; Mody et al. 2001; Arbeitman et al. 2002). Given that eukaryotic genomes contain on the order of 0.5 to 5 x 104 genes, regulating this differential gene expression requires an exceptionally complex array of specific physical interactions among macromolecules.
3.1.1 Most Regulation of Gene Expression Occurs at the Level of Transcription
Eukaryotes employ diverse mechanisms to regulate gene expression, including chromatin condensation, DNA methylation, transcriptional initiation, alternative splicing of RNA, mRNA stability, translational controls, several forms of post-translational modification, intracellular trafficking, and protein degradation (Lewin 2000; Alberts et al. 2002). Of these broad categories, the most common point of control is the rate of transcriptional initiation (Latchman 1998; Carey and Smale 2000; Lemon and Tjian 2000; White 2001). For virtually every eukaryotic gene where relevant information exists, transcriptional initiation appears to be the primary determinant, or one of the most important determinants, of the overall gene expression profile.
3.1.2 Transcriptional Regulation Is Primarily Gene-Specific
To a first approximation, the transcription of each gene in a eukaryotic genome is controlled independently. Operons (multi-locus transcripts regulated by a single promoter) are unusual in eukaryotes, a contrast with most prokaryotes. (Eukaryotic exceptions include the protozoan Trypanosoma brucei and the nematode Caenorhabditis elegans, where a substantial fraction of genes are transcribed as polycistronic mRNAs: Blumenthal 1998). Even paralogs within gene families are typically regulated independently and often have quite different expression profiles (e.g., Ferris and Whitt 1979; Fang and Brandhorst 1996; Christophides et al. 2000; Gu et al. 2002). Although a regulatory region sometimes directly influences the transcription of two loci (for examples, see section 3.3.7 and fig. 2), such cases apparently are uncommon. Distributed transcriptional regulation allows selection to fine-tune the expression profile of each gene independently.
|
3.1.3 Gene Expression Profiles Are Complex
Most genes are differentially transcribed across the life cycle, according to environmental conditions, in different cell types and regions, and among sexes. Transcriptional regulation is a highly dynamic process: rates of RNA synthesis can fluctuate by orders of magnitude, change over time scales of minutes, and differ among adjacent cells. Most genes have spatially and temporally heterogeneous expression profiles. Genes encoding regulatory proteins possess some of the most complex expression profiles. In metazoans and metaphytes, most such genes are expressed in several distinct domains (Gerhart and Kirschner 1997; Davidson 2001). For instance, the transcription factor Pax-6 is expressed at different times and at different levels in the telencephalon, hindbrain, and spinal cord of the central nervous system; in the lens, cornea, neural and pigmented retina, lacrimal gland, and conjunctiva of the eye; and in the pancreas (Kammandel et al. 1999). Where data are available, they link distinct phases of these complex expression profiles to distinct regulatory functions (Wray and Lowe 2000; Davidson 2001; Wilkins 2002). Although the transcription profiles of "housekeeping" genes are generally much simpler, most are transcribed at different levels among cell types and are shut down in response to extreme environmental conditions such as heat shock.
3.1.4 Promoters Integrate Information and Alter Transcription Accordingly
At its most fundamental level, the function of a promoter is to integrate information about the status of the cell in which it resides, and to alter the rate of transcriptional initiation of a single gene accordingly. The inputs that a promoter integrates can take many forms. The promoters of genes expressed during early development integrate spatial and temporal inputs to produce highly dynamic patterns of transcription in specific regions of the embryo (Davidson 2001; Wilkins 2002). The promoters of genes encoding housekeeping proteins are constitutively active, but they can shut down in response to specific conditions, such as heat shock or starvation (Pirkkala, Nykanen, and Sistonen 2001). Other promoters are off by default, but they can be activated in response to specific hormonal, physiological, or environmental cues (Benecke, Gaudon, and Gronemeyer 2001; Shore and Sharrocks 2001). These diverse inputs eventually reach promoters in the form of transcription factors, proteins that bind in a sequence-specific manner to the DNA near a gene, altering rates of transcriptional initiation. The shifting array of active transcription factors within the nucleus determines whether a gene is transcribed or not and how much mRNA is produced from it.
3.2 Promoter Structure
The organization of promoters is much less regular than that of coding sequences and lacks an equivalent of the genetic code or other sequence features that provide a consistent relationship to function. This fact has far-reaching implications for studying the evolution of promoter structure and function (see section 5).
3.2.1 Promoters Lack Universal Structural Features
No consistent sequence motifs exist for promoters of protein-coding genes. Two functional features are always present (fig. 1A), although they cannot always be recognized from sequence information alone. One is a basal promoter (or core promoter), the site upon which the enzymatic machinery of transcription assembles. Although necessary for transcription, the basal promoter is apparently not a common point of regulation, and it cannot by itself generate functionally significant levels of mRNA (Kuras and Struhl 1999; Lee and Young 2000; Lemon and Tjian 2000). The other functional feature is a collection of diverse transcription factor binding sites that confer specificity of transcription. Proteins bound to these sites produce a scalar response, the frequency with which new transcripts are initiated (Latchman 1998; Davidson 2001; Locker 2001).
|
3.2.2 The Transcriptional Machinery Assembles on the Basal Promoter
Eukaryotic genes that encode proteins are transcribed by the RNA polymerase II holoenzyme complex, which is composed of 10 to 12 proteins (Orphanides, Lagrange, and Reinberg 1998; Lee and Young 2000). This transcriptional machinery assembles on the basal promoter, a
100-bp region whose functions are to provide a docking site for the transcription complex and to position the start of transcription relative to coding sequences (Reinberg et al. 1998; Lee and Young 2000; Pugh 2001). Basal promoter sequences differ among genes. For many genes, the critical binding site is a TATA box, usually located about 2530 bp 5' of the transcription start site. However, many genes lack a TATA box and instead contain an initiator element spanning the transcription start site. So-called null basal promoters exist that contain neither a TATA box nor an initiator element, and some basal promoters that contain one or the other also contain additional protein binding sites for general transcription factors (Carey and Smale 2000; Ohler and Niemann 2001). A gene may have more than one basal promoter, each of which initiates transcription at a distinct position (fig. 2J and K), and both TATA and TATA-less basal promoters can be associated with alternate start sites of the same gene (Goodyer et al. 2001). The functional consequences of differences in basal promoter structure are not well understood, although genes with TATA-less basal promoters may generally be transcribed constitutively at relatively low levels (Pugh 2001). A key early step in transcriptional initiation is attachment of TATA-binding protein (TBP) to DNA (Jackson-Fisher et al. 1999; Kuras and Struhl 1999). In promoters lacking TATA boxes, proteins that associate with other basal promoter motifs facilitate TBP association with DNA in a sequence-independent manner. Once TBP binds, several TBP-associated factors (TAFs) guide the RNA polymerase II holoenzyme complex onto the DNA (fig. 1B). This step, which can be positively or negatively modulated by transcription factors bound at other sites, is one of the most important points of transcriptional regulation (Latchman 1998; Lee and Young 2000; Lemon and Tjian 2000).
3.2.3 The Start Site of Transcription Varies in Both Sequence and Position
The start site of transcription, unlike the start site of translation, does not require a specific sequence motif and cannot be identified from sequence data. After the RNA polymerase II holoenzyme complex assembles onto DNA, a second contact is established
30 bp downstream. This second contact point is the start site of transcription. It is thus the physical size of the transcriptional machinery and the particular composition of binding sites that facilitate its binding to the basal promoter and that determine where transcription begins (fig. 1B). Spacing between the start sites of transcription and translation differs considerably among genes, ranging from
101 to 104 bp; the 5' untranslated region (UTR) can also contain introns that alter its length post-transcriptionally. The functional consequences of differences in 5' UTR length are not well understood.
3.2.4 Basal Promoters Provide Limited Transcriptional Activity and Specificity
By itself, a basal promoter initiates transcription at a very low rate, even when the local chromatin is suitably decondensed (Jackson-Fisher et al. 1999; Kuras and Struhl 1999; Lemon and Tjian 2000). Furthermore, most of the proteins that bind to basal promoter motifs are ubiquitously expressed and therefore provide little regulatory specificity (Carey and Smale 2000; Lee and Young 2000; Lemon and Tjian 2000). These proteins are known as general transcription factors. A few tissue-specific isoforms of these proteins are known, however, and may exert some degree of transcriptional regulation (Holstege et al. 1998; Smale et al. 1998). Additional mechanisms of transcriptional regulation involving the basal promoter are discussed later (see section 3.3.6).
3.2.5 Specificity of Transcription Is Controlled by Proteins that Bind to Discrete, Idiosyncratic Sites
Producing functionally significant levels of mRNA requires the sequence-specific association of transcription factors with DNA sequences outside the basal promoter (Weinzierl 1999; Carey and Smale 2000; Lemon and Tjian 2000). The composition and organization of these transcription factor binding sites varies enormously among genes (fig. 2). The nucleotide sequences of these binding sites determine which transcription factors are capable of associating with the promoter of a given gene. Which transcription factors actually do so depends on which of them is present in the nucleus in an active form and, in many cases, on the presence of cofactors as well (Locker 2001). The complement of active transcription factors within the nucleus differs during the course of development, in response to environmental conditions, across regions of the organism, and among cell types (Latchman 1998; Davidson 2001). This changing array of transcription factors provides nearly all of the control over when, where, at what level, and under what circumstances a particular gene is transcribed. Thus, the genetic basis for the expression profile of each gene resides in part within its promoter and in part within the many other segments of the genome that encode specific transcription factors that bind to the promoter.
3.3 Transcription Factor Binding Sites
The composition and configuration of transcription factor binding sites near a gene are major determinants of its expression profile, and they therefore constitute an important class of sequences that are potential targets of natural selection on gene expression.
3.3.1 Promoters Contain Numerous Transcription Factor Binding Sites
Identifying genuine binding sites is not straightforward for a variety of reasons (see sections 3.3.3 and 5.2; Weinzierl 1999; Carey and Smale 2000). It is difficult to be certain that all functional binding sites within a promoter have been identified, and it is prudent to assume that some binding sites remain uncharacterized even within well-studied promoters. Because of this uncertainty, the range and average number of binding sites found in a typical promoter is not known, much less any correlations between these parameters and the nature of the gene product or mode of expression. Nonetheless, a perusal of well-characterized eukaryotic promoters suggests that numbers on the order of 1050 binding sites for 515 different transcription factors is not unusual (for examples, see Arnone and Davidson [1997] and Wilkins [2002]).
3.3.2 Transcription Factor Binding Sites Are Distributed Sparsely and Unevenly
Binding sites typically comprise a minority of the nucleotides within a promoter region. This fraction ranges from 10% to 20% within relatively well-studied regulatory regions (table 1, fig. 3). These regions are often interspersed with regions that contain no binding sites (fig. 2). Disjunct regulatory regions often produce discrete portions of the total transcription profile (see section 3.5.4). Nucleotides that do not affect the specificity of transcription factor binding are generally assumed to be nonfunctional with respect to transcription. In some cases, however, these nucleotides may influence the local conformation of DNA, with direct consequences for protein binding (e.g., Naylor and Clark 1990; Hizver et al. 2001; Rothenburg et al. 2001). Spacing between binding sites varies enormously, from partial overlap to tens of kilobases (figs. 2 and 3). Functional constraints on binding site spacing are often related to protein interactions that take place during DNA binding (see section 3.5.2).
|
|
3.3.3 Transcription Factor Binding Sites Are Short and Imprecise
Because of the way transcription factors interact with DNA, several different criteria are used to define binding sites. (1) Physical contact versus binding specificity. The segment of DNA protected from nuclease digestion by a transcription factor (its "footprint") is typically wider than the nucleotides that confer binding specificity (its binding site). Most transcription factor binding sites span 58 bp (table 2), whereas footprints are typically 1020 bp. (2) Single versus multiple sequences. Most binding sites can tolerate at least one, and often more, specific nucleotide substitution without completely losing functionality (Latchman 1998; Courey 2001). This is evident from comparing different binding sites known to bind the same transcription factor and from in vitro assays of protein-DNA binding (see section 3.4.4; for examples within a single promoter, see fig. 3). The full range of sequences (in practice, often poorly understood) that can bind a particular transcription factor with significantly higher specificity than random DNA under physiological conditions is often described by a position weight matrix, in which the probability that each position in the binding site will be represented by a particular nucleotide is tabulated. When binding site matrices are factored in, the number of nucleotides required for specific protein binding drops to about 46 bp for a typical binding site (table 2). Although binding site matrices are generally composed of related sequences, some transcription factors bind to rather different sequences in association with different binding partners (e.g., jun/jun, fos/jun, CRE-BP1/jun dimers: Latchman 1998, Fairall and Schwabe 2001). (3) Informatic versus functional consensus. The term consensus sequence refers to the single "best" variant of the binding site matrix or to a degenerate sequence that captures most of the binding site matrix (table 2). Two rather different criteria are used to define consensus sequences: sequence comparisons (most commonly, simply the average sequence of multiple instances of binding sites for same protein) and biochemical assays (the single variant with the highest affinity for the protein in vitro).
|
3.3.4 Many Potential Binding Sites Are Nonfunctional
Given that there are many different transcription factors with different binding matrices, and given that binding sites are short and imprecise, every kilobase of genomic DNA contains many dozens of potential transcription factor binding sites on the basis of random similarity (Carroll, Grenier, and Weatherbee 2001; Stone and Wray 2001). For a variety of reasons (fig. 4), many of these consensus matches don't bind protein in vivo and have no influence on transcription (Biggin and McGinnis 1997; Weinzierl 1999; Li and Johnston 2001). Identifying the potential binding sites that actually bind protein requires biochemical and experimental tests (see sections 5.2 and 5.3).
|
3.3.5 Variants Within a Binding Site Matrix Can Differ Functionally
Although most transcription factors can bind to several distinct sequences, they may do so with different kinetics (Czerny, Schaffner, and Busslinger 1993; Carey and Smale 2000). Differences in binding affinities are particularly important when two binding sites overlap physically or are located very near each other, because only one binding site can be occupied by protein at a time (fig. 3A: Otx, Z, and CG binding sites). In such cases, differences in protein concentrations and binding kinetics will determine which binding site is occupied most of the time. Differences in kinetics can also be important for binding sites not near each other, because active promoters compete for a single pool of transcription factors within each nucleus and there are typically fewer transcription factors present than there are binding sites in a genome.
3.3.6 Transcription Factor Binding Sites Occupy a Wide Range of Positions Relative to the Transcription Unit
Although transcription factor binding sites sometimes occupy a single, discrete region near the start site of transcription (fig. 2AE), in many cases they are dispersed into several distinct clusters (fig. 2I, KL, and P). The physical extent of cis-regulatory regions varies by nearly three orders of magnitude, from a few hundred base pairs to >100 kb (fig. 2). An extreme example of physical dispersion is a regulatory module of the Shh locus in humans and mice that lies
800 kb distant from the start site of transcription (Lettice et al. 2002). The position of transcription factor binding sites relative to the transcription unit also differs enormously among genes. They often lie within a few kb 5' of the basal promoter (fig. 2AG), but they can occupy a wide range of other positions: > 30 kb 5' of the basal promoter (e.g., Ubx in D. melanogaster: Simon et al. 1990; Pax-6 in mouse: Kammandel et al. 1999; APOB in humans: Nielsen et al. 1998); within the 5' UTR (Scr in D. melanogaster: Calhoun, Stathopoulos, and Levine 2002); within introns (Otx in the sea urchin Strongylocentrotus purpuratus: Yuh et al. 2002; CCR5 in humans: Bamshad et al. 2002); > 30 kb 3' of the transcription unit (BMP5 in mouse: DiLeone, Russell, and Kingsley 1998); and, in rare instances, even within a coding exon (keratin 18 in humans: Neznanov, Umezawa, and Oshima 1997; nonA in Drosophila: Sandrelli et al. 2001). This diversity of positions is possible because DNA looping allows interaction between proteins associated with DNA at distant binding sites (fig. 1B) (see section 3.4.5). Binding sites may even lie on the far side of an adjacent locus (fig. 2O). The position of binding sites for some transcription factors may be functionally constrained. For instance, CCAAT binding sites for the transcription factor CBP (CREB binding protein) are generally located 50100 bp 5' of the transcription start site, and those for Sp1 are often located near the basal promoter of many mammalian genes. For most transcription factors, however, binding sites lack any obvious spatial restriction relative to other features of the locus. In general, the functional consequences of binding site position are poorly understood.
3.3.7 Specific Sequences Limit the Regulatory Influence of Binding Sites
Because binding sites can interact with basal promoters that are tens or even hundreds of kilobases distant, they are potentially able to influence transcription at more than one locus. At least three mechanisms spatially restrict this influence. (1) Insulator sequences. Some, and perhaps many, promoters are bounded by insulator sequences (also known as boundary elements) (Wolffe 1994; Bell and Felsenfeld 1999; Dillon and Sabbattini 2000). Mechanisms of insulator function are not well understood but appear to involve chromatin modulation (Bell and Felsenfeld 1999). (2) Basal promoter selectivity. Some regulatory sequences interact preferentially with TATA or TATA-less basal promoters, even if a basal promoter of the other kind is closer to them (Ohtsuki, Levine, and Cai 1998). (3) Selective tethering. Sequences immediately 5' of a basal promoter may help selectively recruit transcription factor complexes bound at distant sites. For instance, an activator module (enhancer) located close to the ftz locus in Drosophila associates only with the more distant basal promoter of Scr (fig. 2O) (Calhoun, Stathopoulos, and Levine 2002).
3.3.8 Some Binding Sites Affect Transcription at More than One Locus
Although most binding sites directly influence the expression of just one gene, many exceptions are known. One manifestation is a "divergent promoter," where binding sites regulate transcription of paralogous loci that lie on opposite strands of DNA with their 5' ends centrally located (fig. 2M). Binding site "sharing," or cross-regulation, of adjacent loci also occurs in other contexts: paralogs that are transcribed convergently (fig. 2N) or in parallel (e.g., beta-globin: Grosveld et al. 1993; Hox complex: Ohtsuki, Levine, and Cai 1998; Kmita, Kondo, and Duboule 2000) and even among genealogically unrelated loci that lie near each other (fig. 2J and O). Single mutations in single binding sites may affect the transcription of more than one gene. In humans, for example, segregating variants are known that simultaneously influence transcription of the genes encoding beta-globin and gamma-globin (Metherall, Gillespie, and Forget 1988; Grosveld et al. 1993), the insulin and IGF2 genes (Paquette et al. 1998), and the APOA1 and APOCIII genes (Li et al. 1995; Naganawa et al. 1997). In the last case (fig. 2O), nucleotide variants have distinct effects on each locus: the rare haplotype downregulates APOA1 in the colon but upregulates APOCIII in the liver. Cross-regulation may be the reason for the long-term physical linkage of genes in the Hox complexes of animals (Lufkin 2001). The general prevalence of cross-regulation remains uncertain (Bonifer 2000). Even where cross-regulation is known to occur, however, the involved loci are sometimes each regulated by unique regulatory sequences as well as shared ones, providing some degree of differential regulation.
3.4 Transcription Factors
The transcription of every gene is regulated by transcription factors and cofactors that interact with its promoter. The distant and dispersed regions of the genome that encode these proteins constitute a second important class of sequences that are potentially the target of natural selection on the transcription profile of a particular gene.
3.4.1 Transcription Factors Belong to a Relatively Small Number of Gene Families
Most transcription factors belong to gene families (Latchman 1998; Locker 2001). The size of each transcription factor gene family differs considerably among genomes (table 3), but the reasons and functional consequences of these differences are not understood. Existing paralogs are the result of duplications that occurred across a wide range of times, from before the divergence of eukaryotic kingdoms to much more recently (Duboule 1994; Bharathan et al. 1997; Dailey and Basilico 2001; Stauber, Prell, and Schmidt-Ott 2002). There are approximately 12 to 15 structurally distinct DNA-binding domains known from eukaryotic transcription factors (Harrison 1991; Fairall and Schwabe 2001). For intensively studied organisms, the known transcription factor families may constitute a nearly complete list. Far less is known about the diversity and evolutionary history of transcription cofactors, proteins that bind to transcription factors but not to DNA (fig. 1B; see the following section and section 3.4.5).
|
3.4.2 Transcription Factors Are Structurally and Functionally Modular Proteins
Most transcription factors contain several distinct functional domains. These may include almost any combination of the following. (1) DNA-binding domains. The amino acids that comprise the DNA binding region may be contiguous (e.g., homeodomain, MADS box) or dispersed within the primary sequence (e.g., Zn-fingers). Some transcription factors contain two distinct DNA binding regions (e.g., many Pax family members contain both a homeodomain and a paired-box domain). (2) Protein-protein interaction domains. Transcription factors engage in a variety of interactions with other proteins (see section 3.4.5). Most transcription factors contain from one to several such domains. Interaction domains, which generally are more difficult to recognize from sequence inspection than DNA binding domains, include leucine zippers and the pentapeptide motif of homeodomain proteins (Latchman 1998). (3) Domains that act as intracellular trafficking signals. Many transcription factors contain a nuclear localization signal. In some cases, the activity of a transcription factor may be modulated by controlling the ratio of cytoplasmic-to-nuclear localization (e.g., Exd: Abu-Shaar, Ryoo, and Mann 1999). (4) A ligand-binding domain. Some transcription factors, such as specific steroid hormones, can bind ligands which modulate their activity. Most known cases belong to the nuclear receptor family (Benecke, Gaudon, and Gronemeyer 2001), but an unrelated Ca2+-binding transcription factor has recently been discovered (Carrión et al. 1999).
Many protein-DNA binding domains predate the divergence of plants and animals (e.g., homeodomain: Bharathan et al. 1997), as do some protein-protein interaction domains (Bürglin 1997). The evolutionary history of transcription factor gene families includes many examples of "domain shuffling" and loss of specific domains. For instance, a paralog may retain a DNA-binding domain but lose a protein-protein interaction domain responsible for transcriptional activation; the resulting protein will function as a repressor if it competes for binding sites with a paralog that contains an activation domain (e.g., Sp family: Suske 1999). Transcription cofactors, by definition, lack a DNA-binding domain, but they typically contain domains that mediate a specific protein-protein association with a transcription factor and directly or indirectly interact with effector complexes (either the transcriptional machinery or chromatin remodeling complexes).
3.4.3 Transcription Factor Structure Determines DNA Binding Specificity
The DNA binding domain of most transcription factors is a short motif, most commonly an alpha helix but sometimes a beta-strand or a less organized loop, that inserts into the major groove of double-stranded DNA (Choo and Klug 1997; Jones et al. 1999; Fairall and Schwabe 2001). A single amino acid substitution within the binding domain can alter binding specificity (Treisman et al. 1989; Mathias et al. 2001). DNA binding domains are often highly conserved evolutionarily (Duboule 1994; Dailey and Basilico 2001), although functional polymorphisms that lead to differences in binding kinetics are known (e.g., Brickman et al. 2001). Sequence-specific protein-DNA contacts rarely extend across more than 5 bp, and for some motifs, such as Zn-fingers, they extend only 3 bp. The extent of this physical interaction is not sufficient to provide much sequence specificity, as a given 5-bp sequence occurs on average every 1,024 bp. Three structural features can increase DNA binding specificity (Latchman 1998; Fairall and Schwabe 2001): (1) multiple DNA binding domains can exist within a single transcription factor (e.g., most Pax family members contain both paired-box and homeodomain DNA binding domains, whereas all Zn-finger transcription factors contain multiple Zn-fingers); (2) additional structural features can bind nearby nucleotides through minor groove contacts (e.g., many homeodomain and GATA factors); and (3) binding to DNA may require homodimerization or heterodimerization (e.g., myc/mad/max, fos/jun, and most nuclear receptor family members). All three structural features effectively increase the number of specific nucleotides required for efficient binding and typically involve noncontiguous nucleotides within promoters (table 2).
3.4.4 Transcription Factors Bind to More than One Sequence, Although They Do So with Different Affinities
Transcription factors bind relatively tightly to double-stranded DNA (Kd is typically in the range of 109 to 1010), with a high degree of sequence specificity (Biggin and McGinnis 1997; Carey and Smale 2000). Because of their sequence specificity and binding kinetics, and because many potential target sites are present in a genome, eukaryotic transcription factors need to be present in copy numbers of
520 x 103 per nucleus in order to bind efficiently (Dröge and Müller-Hill 2001). Although they associate in a sequence-specific manner, most transcription factors bind a range of motifs rather than a single one (see section 3.3.3). The extent of this binding site matrix differs considera



