Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
In this chapter, we discuss general principles of gene structure and expression as well as mechanisms underlying the regulation of tissue-specific and inducible gene expression. We will see that proteins (transcription factors) control gene transcription by interacting with regulatory elements in DNA (e.g., promoters and enhancers). Because many transcription factors are effector molecules in signal-transduction pathways, these transcription factors can coordinately regulate gene expression in response to physiological stimuli. Finally, we describe the important roles of epigenetic and post-transcriptional regulation of gene expression. Because many of the proteins and DNA sequences are known by abbreviations, the Glossary at the end of the chapter identifies these entities.
The haploid human genome contains 20,000 to 30,000 distinct genes, but only about one third of these genes are actively translated into proteins in any individual cell. Cells from different tissues have distinct morphological appearances and functions and respond differently to external stimuli, even though their DNA content is identical. For example, although all cells of the body contain an albumin gene, only liver cells (hepatocytes) can synthesize and secrete albumin into the bloodstream. Conversely, hepatocytes cannot synthesize insulin, which pancreatic β cells produce. The explanation for these observations is that expression of genes is regulated so that some genes are active in hepatocytes and others are silent. In pancreatic β cells, a different set of genes is active; others, such as those expressed only in the liver, are silent. How does the organism program one cell type to express liver-specific genes, and another to express a set of genes appropriate for the pancreas? This phenomenon is called tissue-specific gene expression.
A second issue is that genes in individual cells are generally not expressed at constant, unchanging levels (constitutive expression). Rather, their expression levels often vary widely in response to environmental stimuli. For example, when blood glucose levels decrease, α cells in the pancreas secrete the hormone glucagon (see pp. 1050–1053 ). Glucagon circulates in the blood until it reaches the liver, where it causes a 15-fold increase in expression of the gene that encodes phosphoenolpyruvate carboxykinase (PEPCK), an enzyme that catalyzes the rate-limiting step in gluconeogenesis (see pp. 1051 ). Increased gluconeogenesis then contributes to restoration of blood glucose levels toward normal. This simple regulatory loop, which necessitates that the liver cells perceive the presence of glucagon and stimulate PEPCK gene expression, illustrates the phenomenon of inducible gene expression.
The “central dogma of molecular biology” states that genetic information flows unidirectionally from DNA to proteins. DNA is a polymer of nucleotides, each containing a nitrogenous base (adenine, A; guanine, G; cytosine, C; or thymine, T) attached to deoxyribose 5′-phosphate. The polymerized nucleotides form a polynucleotide strand in which the sequence of the nitrogenous bases constitutes the genetic information. With few exceptions, all cells in the body share the same genetic information. Hydrogen-bond formation between bases (A and T, or G and C) on the two complementary strands of DNA produces a double-helical structure.
DNA has two functions. The first is to serve as a self-renewing data repository that maintains a constant source of genetic information for the cell. This role is achieved by DNA replication, which ensures that when cells divide, the progeny cells receive exact copies of the DNA. The second purpose of DNA is to serve as a template for the translation of genetic information into proteins, which are the functional units of the cell. This second purpose is broadly defined as gene expression.
Gene expression involves two major processes ( Fig. 4-1 ). The first process— transcription —is the synthesis of RNA from a DNA template, mediated by an enzyme called RNA polymerase II. The resultant RNA molecule is identical in sequence to one of the strands of the DNA template except that the base uracil (U) replaces thymine (T). The second process— translation —is the synthesis of protein from RNA. During translation, the genetic code in the sequence of RNA is “read” by transfer RNA (tRNA), and then amino acids carried by the tRNA are covalently linked together to form a polypeptide chain. In eukaryotic cells, transcription occurs in the nucleus, whereas translation occurs on ribosomes located in the cytoplasm. Therefore, an intermediary RNA, called messenger RNA (mRNA), is required to transport the genetic information from the nucleus to the cytoplasm. The complete process, proceeding from DNA in the nucleus to protein in the cytoplasm, constitutes gene expression.
Although the central dogma of molecular biology applies to most protein-coding genes, exceptions exist. For example, RNA viruses (such as the human immunodeficiency virus [HIV] that causes acquired immunodeficiency syndrome) contain their genetic information in the sequence of an RNA genome. Upon infection with HIV, the cell “reverse transcribes” the RNA genome into double-stranded DNA that then integrates into the host DNA genome. Transcription of the virally encoded DNA by the host transcriptional machinery produces RNA molecules that become part of new HIV particles. Cells transcribe some genes into RNAs that do not encode proteins. So-called noncoding RNAs include ribosomal RNAs (rRNAs) and transfer RNA (tRNA) that participate in protein translation, small nuclear RNAs (snRNAs) that are involved in RNA splicing, and microRNAs (miRNAs) that regulate mRNA abundance and translation (see pp. 99–100 ).
Figure 4-2 depicts the structure of a typical eukaryotic protein-coding gene. The gene consists of a segment of DNA that is transcribed into RNA. It extends from the site of transcription initiation to the site of transcription termination. The region of DNA that is immediately adjacent to and upstream (i.e., in the 5′ direction) from the transcription initiation site is called the 5′ flanking region. The corresponding domain that is downstream (3′) to the transcription termination site is called the 3′ flanking region. (Recall that DNA strands have directionality because of the 5′ to 3′ orientation of the phosphodiester bonds in the sugar-phosphate backbone of DNA. By convention, the DNA strand that has the same sequence as the RNA is called the coding strand, and the complementary strand is called the noncoding strand. The 5′ to 3′ orientation refers to the coding strand.) Although the 5′ and 3′ flanking regions are not transcribed into RNA, they frequently contain DNA sequences, called regulatory elements, that control gene transcription. The site where transcription of the gene begins, sometimes called the cap site, may have a variant of the nucleotide sequence 5′-ACTT(T/C)TG-3′ (called the cap sequence), where T/C means T or C. The A is the transcription initiation site. Transcription proceeds to the transcription termination site, which has a less defined sequence and location in eukaryotic genes. Slightly upstream from the termination site is another sequence called the polyadenylation signal, which often has the sequence 5′-AATAAA-3′.
The RNA that is initially transcribed from a gene is called the primary transcript (see Fig. 4-2 ) or precursor mRNA (pre-mRNA). Before it can be translated into protein, the primary transcript must be processed into a mature mRNA in the nucleus. Most eukaryotic genes contain exons, DNA sequences that are present in the mature mRNA, alternating with introns, which are not present in the mRNA. The primary transcript is colinear with the coding strand of the gene and contains the sequences of both the exons and the introns. To produce a mature mRNA that can be translated into protein, the cell must process the primary transcript in four steps.
First, the cell adds an unusual guanosine base, which is methylated at the 7 position, via a 5′-5′ phosphodiester bond to the 5′ end of the transcript. The result is a 5′ methyl cap. The presence of the 5′ methyl cap is required for export of the mRNA from the nucleus to the cytoplasm as well as for translation of the mRNA.
Second, the cell removes the sequences of the introns from the primary transcript via a process called pre-mRNA splicing. Splicing involves the joining of the sequences of the exons in the RNA transcript and the removal of the intervening introns. As a result, mature mRNA (see Fig. 4-2 ) is shorter and not colinear with the coding strand of the DNA template.
The third processing step is cleavage of the RNA transcript about 20 nucleotides downstream from the polyadenylation signal, near the 3′ end of the transcript.
The fourth step is the addition of a string of 100 to 200 adenine bases at the site of the cleavage to form a poly(A) tail. This tail contributes to mRNA stability.
The mature mRNA produced by RNA processing not only contains a coding region—the open-reading frame—that encodes protein but also sequences at the 5′ and 3′ ends that are not translated into protein—the 5′ and 3′ untranslated regions (UTRs), respectively. Translation of the mRNA on ribosomes always begins at the codon AUG, which encodes methionine, and proceeds until the ribosome encounters one of the three stop codons (UAG, UAA, or UGA). Thus, the 5′ end of the mRNA is the first to be translated and provides the N terminus of the protein; the 3′ end is the last to be translated and contributes the C terminus.
Although DNA is commonly depicted as linear, chromosomal DNA in the nucleus is actually organized into a higher-order structure called chromatin. This packaging is required to fit DNA with a total length of ~1 m into a nucleus with a diameter of 10 −5 m. Chromatin consists of DNA associated with histones and other nuclear proteins. The basic building block of chromatin is the nucleosome ( Fig. 4-3 ), each of which consists of a protein core and 147 base pairs (bp) of associated DNA. The protein core is an octamer of the histones H2A, H2B, H3, and H4. DNA wraps twice around the core histones to form a solenoid-like structure. A linker histone, H1, associates with segments of DNA between nucleosomes. Regular arrays of nucleosomes have a beads-on-a-string appearance and constitute the so-called 11-nm fiber of chromatin, which can condense to form the 30-nm fiber.
Chromatin exists in two general forms that can be distinguished cytologically by their different degrees of condensation. Heterochromatin is a highly condensed form of chromatin that is transcriptionally inactive. In general, highly organized chromatin structure is associated with repression of gene transcription. Heterochromatin contains mostly repetitive DNA sequences and relatively few genes. Euchromatin has a more open structure and contains genes that are actively transcribed. Even in the transcriptionally active “open” euchromatin, local chromatin structure may influence the activity of individual genes.
Gene expression involves eight steps ( Fig. 4-4 ):
Step 1: Chromatin remodeling. Before a gene can be transcribed, some local alteration in chromatin structure must occur so that the enzymes that mediate transcription can gain access to the genomic DNA. The alteration in chromatin structure is called chromatin remodeling, which may involve loosening of the interaction between histones and DNA, repositioning of nucleosomes, or local depletion of histones.
Step 2: Initiation of transcription. In this step, RNA polymerase is recruited to the gene promoter and begins to synthesize RNA that is complementary in sequence to one of the strands of the template DNA. For most eukaryotic genes, initiation of transcription is the critical, rate-limiting step in gene expression.
Step 3: Transcript elongation. During transcript elongation, RNA polymerase proceeds down the DNA strand and sequentially adds ribonucleotides to the elongating strand of RNA. N4-1
Regulation of elongation appears to be critical for the expression of certain genes, such as some genes of HIV-1, the causative agent of acquired immunodeficiency syndrome (AIDS). HIV-1 is a retrovirus (RNA virus) that preferentially infects cells of the immune system. After infection, the RNA viral genome is “reverse” transcribed into double-stranded DNA, which integrates into the host genome. A viral promoter that is located in the long terminal repeat of the viral genome then drives expression of the viral genes. Immediately downstream from the promoter—and within the 5′untranslated region—is a regulatory element known as the trans-activation response element (TAR). Unlike the regulatory elements that we have discussed above, this element is active in transcribed RNA. The sequence of TAR contains an inverted repeat, and a stretch of nucleotides on one part of the TAR pairs with nucleotides on the other part to create a hairpin structure in this viral transcript ( eFig. 4-1 ). Because the inverted repeat is imperfect, the hairpin contains a “bulge.” Elongation of transcription cannot occur unless a virally encoded protein called Tat binds to this bulge in the TAR portion of the RNA transcript. In the absence of Tat, transcription initiates but elongation does not proceed past the TAR; the resulting truncated transcripts do not encode proteins. In the presence of Tat, Pol II can read through the TAR and elongation proceeds normally, producing full-length RNA. It appears that the function of TAR is to recruit Tat to the promoter. Tat, in turn, associates with P-TEFb, a kinase that phosphorylates the CTD of Pol II (see p. 85 and Fig. 4-10 ) and stimulates transcription elongation.
Step 4: Termination of transcription. After producing a full-length RNA, the enzyme halts elongation.
Step 5: RNA processing. As noted before, RNA processing involves (a) addition of a 5′ methylguanosine cap, (b) pre-mRNA splicing, (c) cleavage of the RNA strand, and (d) polyadenylation.
Step 6: Nucleocytoplasmic transport. The next step in gene expression is the export of the mature mRNA through pores in the nuclear envelope (see p. 21 ) into the cytoplasm. Nucleocytoplasmic transport is a regulated process that is important for mRNA quality control.
Step 7: Translation. The mRNA is translated into proteins on ribosomes. During translation, the genetic code on the mRNA is read by tRNA, and then amino acids carried by the tRNA are added to the nascent polypeptide chain.
Step 8: mRNA degradation. Finally, the mRNA is degraded in the cytoplasm by a combination of endonucleases and exonucleases.
Each of these steps is potentially a target for regulation (see Fig. 4-4 , right panel):
Gene expression may be regulated by global as well as by local alterations in chromatin structure.
An important related alteration in chromatin structure is the state of methylation of the DNA.
Initiation of transcription can be regulated by transcriptional activators and transcriptional repressors.
Transcript elongation may be regulated by premature termination in which the polymerase falls off (or is displaced from) the template DNA strand; such termination results in the synthesis of truncated transcripts.
Pre-mRNA splicing may be regulated by alternative splicing, which generates different mRNA species from the same primary transcript.
At the step of nucleocytoplasmic transport, the cell prevents expression of aberrant transcripts, such as those with defects in mRNA processing. In addition, mutant transcripts containing premature stop codons may be degraded in the nucleus through a process called nonsense-mediated decay.
Control of translation of mRNA is a regulated step in the expression of certain genes, such as the transferrin receptor gene.
Control of mRNA stability contributes to steady-state levels of mRNA in the cytoplasm and is important for the overall expression of many genes.
Although any of these steps may be critical for regulating a particular gene, transcription initiation is the most frequently regulated (step 2) and is the focus of this chapter. At the end of the chapter, we describe examples of epigenetic regulation of gene expression and regulation at steps that are subsequent to the initiation of transcription—post-transcriptional regulation.
A general principle is that gene transcription is regulated by interactions of specific proteins with specific DNA sequences. The proteins that regulate gene transcription are called transcription factors. Many transcription factors recognize and bind to specific sequences in DNA. The binding sites for these transcription factors are called regulatory elements. Because they are located on the same piece of DNA as the genes that they regulate, these regulatory elements are sometimes referred to as cis- acting factors.
Figure 4-5 illustrates the overall scheme for the regulation of gene expression. Transcription requires proteins (transcription factors) that bind to specific DNA sequences (regulatory elements) located near the genes they regulate (target genes). Once the proteins bind to DNA, they stimulate (or inhibit) transcription of the target gene. A particular transcription factor can regulate the transcription of multiple target genes. In general, regulation of gene expression can occur at the level of either transcription factors or regulatory elements. Examples of regulation at the level of transcription factors include variations in the abundance of transcription factors, their DNA-binding activities, and their ability to stimulate (or to inhibit) transcription. Examples of regulation at the level of regulatory elements include alterations in chromatin structure (which influences accessibility to transcription factors) and covalent modifications of DNA, especially methylation.
Protein-coding genes are transcribed by an enzyme called RNA polymerase II (Pol II), which catalyzes the synthesis of RNA that is complementary in sequence to a DNA template. Pol II is a large protein (molecular mass of 600 kDa) comprising 10 to 12 subunits. Although Pol II catalyzes mRNA synthesis, by itself it is incapable of binding to DNA and initiating transcription at specific sites. The recruitment of Pol II and initiation of transcription requires an assembly of proteins called general transcription factors. Six general transcription factors are known—TFIIA, TFIIB, TFIID, TFIIE, TFIIF, and TFIIH—each of which contains multiple subunits. N4-2 These general transcription factors are essential for the transcription of all protein-coding genes, which distinguishes them from the transcription factors discussed below that are involved in the transcription of specific genes. Together with Pol II, the general transcription factors constitute the basal transcriptional machinery, which is also known as the RNA polymerase holoenzyme or preinitiation complex because its assembly is required before transcription can begin. The basal transcriptional machinery assembles at a region of DNA that is immediately upstream from the gene and includes the transcription initiation site. This region is called the gene promoter ( Fig. 4-6 ).
In vitro, the general transcription factors and Pol II assemble in a stepwise, ordered fashion on DNA. The first protein that binds to DNA is TFIID , which induces a bend in the DNA and forms a platform for the assembly of the remaining factors. Once TFIID binds to DNA, the other components of the basal transcriptional machinery assemble spontaneously by protein-protein interactions. The next general transcription factor that binds is TFIIA, which stabilizes the interaction of TFIID with DNA. Assembly of TFIIA is followed by assembly of TFIIB, which interacts with TFIID and also binds DNA. TFIIB then recruits a preassembled complex of Pol II and TFIIF. Entry of the Pol II–TFIIF complex into the basal transcriptional machinery is followed by binding of TFIIE and TFIIH. TFIIF and TFIIH may assist in the transition from basal transcriptional machinery to an elongation complex, which may involve unwinding of the DNA that is mediated by the helicase activity of TFIIH. Although this stepwise assembly of Pol II and general transcription factors occurs in vitro, the situation in vivo may be different. In vivo, Pol II has been observed in a multiprotein complex containing general transcription factors and other proteins. This preformed complex may be recruited to DNA to initiate transcription.
The promoter is a cis -acting regulatory element that is required for expression of the gene. In addition to locating the site for initiation of transcription, the promoter also determines the direction of transcription. Perhaps somewhat surprisingly, no unique sequence defines the gene promoter. Instead, the promoter consists of modules of simple sequences (DNA elements). N4-3 A common DNA element in many promoters is the Goldberg-Hogness TATA box. The TATA box has the consensus sequence 5′-GNGTATA(A/T)A(A/T)-3′, where N is any nucleotide. The TATA box is usually located ~30 bp upstream (5′) from the site of transcription initiation. The general transcription factor TFIID—a component of the basal transcriptional machinery—recognizes the TATA box, which is thus believed to determine the site of transcription initiation. TFIID itself is composed of TATA-binding protein (TBP) and at least 10 TBP-associated factors (TAFs). The TBP subunit is a sequence-specific DNA-binding protein that binds to the TATA box. TAFs are involved in the activation of gene transcription (more on this below).
Some promoters do not contain a TATA box. Instead, these promoters contain other elements—for example, the initiator (Inr) —that bind general transcription factors. In addition to the TATA box and Inr, gene promoters contain other DNA elements that are necessary for initiating transcription. These elements consist of short DNA sequences and are sometimes called promoter-proximal sequences because they are located within ~100 bp upstream from the transcription initiation site. Promoter-proximal sequences are a type of regulatory element that is required for the transcription of specific genes. Well-characterized examples include the GC box (5′-GGGCGG-3′) and the CCAAT box (5′-CCAAT-3′), N4-4 as well as the CACCC box and octamer motif (5′-ATGCAAAT-3′). These DNA elements function as binding sites for additional proteins (transcription factors) that are necessary for initiating transcription of particular genes. The proteins that bind to these sites help recruit the basal transcriptional machinery to the promoter. Examples include the transcription factor NF-Y, which recognizes the CCAAT box, and Sp1 (stimulating protein 1), which recognizes the GC box. The CCAAT box is often located ~50 bp upstream from the TATA box, whereas multiple GC boxes are frequently found in TATA-less gene promoters. Some promoter-proximal sequences are present in genes that are active only in certain cell types. For example, the CACCC box found in gene promoters of β-globin (see pp. 80–81 ) is recognized by the erythroid-specific transcription factor EKLF (erythroid Kruppel-like factor).
Although the promoter is the site where the basal transcriptional machinery binds and initiates transcription, the promoter alone is not generally sufficient to initiate transcription at a physiologically significant rate. High-level gene expression generally requires activation of the basal transcriptional machinery by specific transcription factors, which bind to additional regulatory elements located near the target gene. Two general types of regulatory elements are recognized. First, positive regulatory elements or enhancers represent DNA-binding sites for proteins that activate transcription; the proteins that bind to these DNA elements are called activators. Second, negative regulatory elements (NREs) or silencers are DNA binding sites for proteins that inhibit transcription; the proteins that bind to these DNA elements are called repressors (see Fig. 4-6 ).
A general property of enhancers and silencers is that they consist of modules of relatively short sequences of DNA, generally 6 to 12 bp. Regulatory elements are generally located in the vicinity of the genes that they regulate. Typically, regulatory elements reside in the 5′ flanking region that is upstream from the promoter. However, enhancers and silencers may be located downstream from the transcription initiation site or a considerable distance from the gene promoter, many hundreds or thousands of base pairs away. Moreover, the distance between the enhancer or silencer and the promoter can often be varied experimentally without substantially affecting transcriptional activity. In addition, many regulatory elements work equally well if their orientation is inverted. Thus, in contrast to the gene promoter, enhancers and silencers exhibit position independence and orientation independence. Another property of regulatory elements is that they are active on heterologous promoters; that is, if enhancers and silencers from one gene are placed near a promoter for a different gene, they can stimulate or inhibit transcription of the second gene.
After transcription factors (activators or repressors) bind to regulatory elements (enhancers or silencers), they may interact with the basal transcriptional machinery to alter gene transcription. How do transcription factors that bind to regulatory elements physically distant from the promoter interact with components of the basal transcriptional machinery? Regulatory elements may be located hundreds of base pairs from the promoter. This distance is much too great to permit proteins that are bound at the regulatory element and promoter to come into contact along a two-dimensional linear strand of DNA. Rather, DNA looping explains these long-range effects, whereby the transcription factor binds to the regulatory element, and the basal transcriptional machinery assembles on the gene promoter. Looping out of the intervening DNA permits physical interaction between the transcription factor and the basal transcriptional machinery, which subsequently leads to alterations in gene transcription.
In addition to enhancers and silencers, which regulate the expression of individual genes, some cis -acting regulatory elements are involved in the regulation of chromosomal domains containing multiple genes.
The first of this type of element to be discovered was the locus control region (LCR), also called the locus-activating region or dominant control region. The LCR is a dominant, positive-acting cis element that regulates the expression of several genes within a chromosomal domain. LCRs were first identified at the β-globin gene locus, which encodes the β-type subunits of hemoglobin. Together with α-type subunits, these β-globin–like subunits form embryonic, fetal, and adult hemoglobin (see Box 29-1 ). The β-globin gene locus consists of a cluster of five genes (ε, γ G , γ A , δ, β) that are distributed over 90 kilobases (kb) on chromosome 11. N4-5 During ontogeny, the genes exhibit highly regulated patterns of expression in which they are transcribed only in certain tissues and only at precise developmental stages. Thus, embryonic globin (ε) is expressed in the yolk sac, fetal globins (γ G , γ A ) are expressed in fetal liver, and adult globins (δ, β) are expressed in adult bone marrow. This tightly regulated expression pattern requires a regulatory region that is located far from the structural genes. This region, designated the LCR, extends from 6 to 18 kb upstream from the ε-globin gene. The LCR is essential for high-level expression of the β-globin–like genes within red blood cell precursors because the promoters and enhancers near the individual genes permit only low-level expression.
The β-globin LCR contains five sites, each with an enhancer-like structure that consists of modules of simple sequence elements that are binding sites for the erythrocyte-specific transcription factors GATA-1 and NF-E2. It is believed that the LCRs perform two functions: one is to alter the chromatin structure of the β-globin gene locus so that it is more accessible to transcription factors, and the second is to serve as a powerful enhancer of transcription of the individual genes. In one model, temporally dependent expression of β-type globin genes is achieved by sequential interactions involving activator proteins that bind to the LCR and promoters of individual genes ( Fig. 4-7 ).
A potential problem associated with the existence of LCRs that can exert transcriptional effects over long distances is that the LCRs may interfere with the expression of nearby genes. One solution to this problem is provided by insulator elements, which function to isolate genes from neighboring regulatory elements. Insulator elements may represent sites of attachment of DNA to the chromosome scaffold, generating loops of physically separated DNA that may correspond to discrete functional domains. A transcription factor called CTCF (CCCTC-binding factor) binds to insulator elements and prevents interactions between regulatory elements and genes located on different sides of the insulator.
Figure 4-7 summarizes our understanding of the arrangement of cis -acting regulatory elements and their functions. Each gene has its own promoter where transcription is initiated. Enhancers are positively acting regulatory elements that may be located either near or distant from the transcription initiation site; silencers are regulatory elements that inhibit gene expression. A cluster of genes within a chromosomal domain may be under the control of an LCR. Finally, insulator elements functionally separate one chromosomal domain from another.
The best-characterized mutations affecting DNA regulatory elements occur at the gene cluster encoding the β-globin–like chains of hemoglobin. Some of these mutations result in thalassemia, whereas others cause hereditary persistence of fetal hemoglobin. The β-thalassemias are a heterogeneous group of disorders characterized by anemia caused by a deficiency in production of the β chain of hemoglobin. The anemia can be mild and inconsequential or severe and life-threatening. The thalassemias were among the first diseases to be characterized at the molecular level. As noted on page 80 , the β-globin gene locus consists of five β-globin–like genes that are exclusively expressed in hematopoietic cells and exhibit temporal colinearity. N4-5 As expected, many patients with β-thalassemia have mutations or deletions that affect the coding region of the β-globin gene. These patients presumably have thalassemia because the β-globin gene product is functionally abnormal or absent. In addition, some patients have a deficiency in β-globin as a result of inadequate levels of expression of the gene. Of particular interest are patients with the Hispanic and Dutch forms of β-thalassemia. These patients have deletions of portions of chromosome 11. However, the deletions do not extend to include the β-globin gene itself. Why, then, do these patients have β-globin deficiency? It turns out that the deletions involve the region 50 to 65 kb upstream from the β-globin gene, which contains the LCR. In these cases, deletion of the LCR results in failure of expression of the β-globin gene, even though the structural gene and its promoter are completely normal. These results underscore the essential role that the LCR plays in β-globin gene expression.
The preceding discussion has emphasized the structure of the gene and the cis -acting elements that regulate gene expression. We now turn to the proteins that interact with these DNA elements and thus regulate gene transcription. Because the basal transcriptional machinery—Pol II and the general transcription factors—is incapable of efficient gene transcription alone, additional proteins are required to stimulate the activity of the enzyme complex. The additional proteins include transcription factors that recognize and bind to specific DNA sequences (enhancers) located near their target genes, as well as others (see pp. 83–84 ) that do not bind to DNA.
Examples of DNA-binding transcription factors are shown in Table 4-1 . The general mechanism of action of a specific transcription factor is depicted in Figure 4-6 . After the basal transcriptional machinery assembles on the gene promoter, it can interact with a transcription factor that binds to a specific DNA element, the enhancer (or silencer). Looping out of the intervening DNA permits physical interaction between the activator (or repressor) and the basal transcriptional machinery, which subsequently leads to stimulation (or inhibition) of gene transcription. The specificity with which transcription factors bind to DNA depends on the interactions between the amino-acid side chains of the transcription factor and the purine and pyrimidine bases in DNA. Most of these interactions consist of noncovalent hydrogen bonds between amino acids and DNA bases. A peptide capable of a specific pattern of hydrogen bonding can recognize and bind to the reciprocal pattern in the major (and to a lesser extent the minor) groove of DNA. Interaction with the DNA backbone may also occur and involves electrostatic interactions (salt bridge formation) with anionic phosphate groups. The site that a transcription factor recognizes (see Table 4-1 ) is generally short, usually less than a dozen or so base pairs.
NAME | TYPE | RECOGNITION SITE * | BINDS AS |
---|---|---|---|
Sp1 | Zinc finger | 5′-GGGCGG-3′ | Monomer |
AP-1 | bZIP | 5′-TGASTCA-3′ | Dimer |
C/EBP | bZIP | 5′-ATTGCGCAAT-3′ | Dimer |
Heat shock factor | bZIP | 5′-NGAAN-3′ | Trimer |
ATF/CREB | bZIP | 5′-TGACGTCA-3′ | Dimer |
c-Myc | bHLH | 5′-CACGTG-3′ | Dimer |
Oct-1 | HTH | 5′-ATGCAAAT-3′ | Monomer |
NF-1 | Novel | 5′-TTGGCN 5 GCCAA-3′ | Dimer |
DNA-binding transcription factors do not recognize single, unique DNA sequences; rather, they recognize a family of closely related sequences. For example, the transcription factor AP-1 (activator protein 1) recognizes the sequences
5′- C A-3′
5′- G A-3′
5′- G T-3′
and so on, as well as each of the complementary sequences. That is, some redundancy is usually built into the recognition sequence for a transcription factor. An important consequence of these properties is that the recognition site for a transcription factor may occur many times in the genome. For example, if a transcription factor recognizes a 6-bp sequence, the sequence would be expected to occur once every 4 6 (or 4096) base pairs, that is, 7 × 10 5 times in the human genome. If redundancy is permitted, recognition sites will occur even more frequently. Of course, most of these sites will not be relevant to gene regulation but will instead have occurred simply by chance. This high frequency of recognition sites leads to an important concept: transcription factors act in combination. Thus, high-level expression of a gene requires that a combination of multiple transcription factors bind to multiple regulatory elements. Although it is complicated, this system ensures that transcription activation occurs only at appropriate locations. Moreover, this system permits greater fine-tuning of the system, inasmuch as the activity of individual transcription factors can be altered to modulate the overall level of transcription of a gene.
An important general feature of DNA-binding transcription factors is their modular construction ( Fig. 4-8 A ). Transcription factors may be divided into discrete domains that bind DNA (DNA-binding domains) and domains that activate transcription (transactivation domains). N4-6 This property was first directly demonstrated for a yeast transcription factor known as GAL4, which activates certain genes when yeast grows in galactose-containing media. GAL4 has two domains. One is a so-called zinc finger (see p. 82 ) that mediates sequence-specific binding to DNA. The other domain is enriched in acidic amino acids (i.e., glutamate and aspartate) and is necessary for transcriptional activation. This “acidic blob” domain of GAL4 can be removed and replaced with the transactivation domain from a different transcription factor VP16 (see Fig. 4-8 B ). The resulting GAL4-VP16 chimera binds to the same DNA sequence as normal GAL4 but mediates transcriptional activation via the VP16 transactivation domain. This type of “domain-swapping” experiment indicates that transcription factors have a modular construction in which physically distinct domains mediate binding to DNA and transcriptional activation (or repression).
The following table groups some of the transcription factors (described in the text) on the basis of the type of transactivation domain (i.e., the domain that activates transcription).
Type of Transactivation Domain | Transcription Factors with This Domain |
---|---|
Acidic blob (rich in negatively charged amino acids—aspartate and glutamate) | GAL4 (a yeast transcription factor) VP16 (a herpesvirus transcription factor) |
Proline rich | CTF (a family of CCAAT-box–binding transcription factor; also known as NF-1) NF-1 (nuclear factor 1) |
Glutamine rich | Sp1 (stimulating protein 1) |
Serine/threonine rich | GHF-1/Pit-1 (growth hormone factor 1, which is the same as pituitary-specific transcription factor 1; an HTH-type transcription factor) |
On the basis of sequence conservation as well as structural determinations from x-ray crystallography and nuclear magnetic resonance spectroscopy, DNA-binding transcription factors have been grouped into families. Members of the same family use common structural motifs for binding DNA (see Table 4-1 ). These structures include the zinc finger, basic zipper, basic helix-loop-helix, helix-turn-helix, and β sheet. Each of these motifs consists of a particular tertiary protein structure in which a component, usually an α helix, interacts with DNA, especially the major groove of the DNA.
The term zinc finger describes a loop of protein held together at its base by a zinc ion that tetrahedrally coordinates to either two histidine residues and two cysteine residues or four cysteine residues. Sometimes two zinc ions coordinate to six cysteine groups. Figure 4-9 A shows a zinc finger in which Zn 2+ coordinates to two residues on an α helix and two residues on a β sheet of the protein. The loop (or finger) of protein can protrude into the major groove of DNA, where amino-acid side chains can interact with the base pairs and thereby confer the capacity for sequence-specific DNA binding. Zinc fingers consist of 30 amino acids with the consensus sequences Cys-X 2–4 -Cys-X 12 -His-X 3–5 -His, where X can be any amino acid. Transcription factors of this family contain at least two zinc fingers and may contain dozens. Three amino-acid residues at the tips of each zinc finger contact a DNA subsite that consists of three bases in the major groove of DNA; these residues are responsible for site recognition and binding (see Table 4-1 ). Zinc fingers are found in many mammalian transcription factors, including several that we discuss in this chapter—Egr-1, Wilms tumor protein (WT-1), and Sp1 (see Table 4-1 )—as well as the steroid-hormone receptors (see p. 71 ).
Also known as the leucine zipper family, the basic zipper (bZIP) family consists of transcription factors that bind to DNA as dimers (see Fig. 4-9 B ). Members include C/EBPβ (CCAAT/enhancer-binding protein-β), c-Fos, c-Jun, and CREB. Each monomer consist of two domains, a basic region that contacts DNA and a leucine zipper region that mediates dimerization. The basic region contains about 30 amino acids and is enriched in arginine and lysine residues. This region is responsible for sequence-specific binding to DNA via an α helix that inserts into the major groove of DNA. The leucine zipper consists of a region of about 30 amino acids in which every seventh residue is a leucine. Because of this spacing, the leucine residues align on a common face every second turn of an α helix. Two protein subunits that both contain leucine zippers can associate because of hydrophobic interactions between the leucine side chains; they form a tertiary structure called a coiled coil. Proteins of this family interact with DNA as homo dimers or as structurally related hetero dimers. Dimerization is essential for transcriptional activity because mutations of the leucine residues abolish both dimer formation and the ability to bind DNA and activate transcription. The crystal structure reveals that these transcription factors resemble scissors in which the blades represent the leucine zipper domains and the handles represent the DNA-binding domains (see Fig. 4-9 B ).
Similar to the bZIP family, members of the basic helix-loop-helix (bHLH) family of transcription factors also bind to DNA as dimers. Each monomer has an extended α-helical segment containing the basic region that contacts DNA, linked by a loop to a second α helix that mediates dimer formation (see Fig. 4-9 C ). Thus, the bHLH transcription factor forms by association of four amphipathic α helices (two from each monomer) into a bundle. The basic domains of each monomer protrude into the major grooves on opposite sides of the DNA. bHLH proteins include the MyoD family, which is involved in muscle differentiation, and E proteins (E12 and E47). MyoD and an E protein generally bind to DNA as heterodimers. N4-7
The MyoD family of transcription factors includes MyoD itself as well as myogenin, myf5, and MRF4. All are involved in controlling the differentiation of muscle. MyoD and an E protein generally bind to DNA as heterodimers.
Some bHLH transcription factors contain additional domains—located immediately adjacent to the HLH domain—that mediate protein dimerization. A leucine zipper motif is contained in bHLH-Zip proteins such as c-Myc and SREBP, and a PAS domain is contained in bHLH-PAS proteins such as HIF-1α.
Homeodomain proteins that regulate embryonic development are members of the helix-turn-helix (HTH) family (see Fig. 4-9 D ). The homeodomain consists of a 60–amino-acid sequence that forms three α helices. Helices 1 and 2 lie adjacent to one another, and helix 3 is perpendicular and forms the DNA-recognition helix. Particular amino acids protrude from the recognition helix and contact bases in the major groove of the DNA. Examples of homeodomain proteins include the Hox proteins, which are involved in mammalian pattern formation; engrailed homologs, which are important in nervous system development; and the POU family members Pit-1, Oct-1, and unc-86. N4-8
On pages 82–83 we describe four families of transcription factors: zinc finger, bZIP, bHLH, and HTH. In each case, an α helix in the transcription factor binds in the major groove of the DNA. Some transcription factors use an antiparallel β-pleated sheet for DNA binding. The β sheet fills the major groove of DNA, and amino-acid side chains that are exposed on the face of the β sheet contact the DNA bases.
In addition to transcription factors that bind to DNA via β-pleated sheets, there are several other transcription factors that do not appear to fall into one of the four structural families listed on pages 82–83 . Thus, it seems likely that other structural motifs can also mediate DNA binding. One example is the forkhead domain. It is also important to note that some transcription factors bind to DNA through more than one domain. Examples include the POU family, in which the POU-specific domain is required in addition to the POU homeodomain for DNA binding.
Some transcription factors that are required for the activation of gene transcription do not directly bind to DNA. These proteins are called coactivators. Coactivators work in concert with DNA-binding transcriptional activators to stimulate gene transcription. They function as adapters or protein intermediaries that form protein-protein interactions between activators bound to enhancers and the basal transcriptional machinery assembled on the gene promoter (see Fig. 4-6 ). Coactivators often contain distinct domains, one that interacts with the transactivation domain of an activator and a second that interacts with components of the basal transcriptional machinery. Transcription factors that interact with repressors and play an analogous role in transcriptional repression are called corepressors.
One of the first coactivators found in eukaryotes was the VP16 herpesvirus protein discussed above (see Fig. 4-8 B ). VP16 has two domains. The first is a transactivation domain that contains a region of acidic amino acids that in turn interacts with two components of the basal transcriptional machinery, general transcription factors TFIIB and TFIID. The other domain of VP16 interacts with the ubiquitous activator Oct-1, which recognizes a DNA sequence called the octamer motif (see Table 4-1 ). Thus, VP16 activates transcription by bridging an activator and the basal transcriptional machinery.
Some coactivators play a general role in the activation of transcription. One example is Mediator, a multiprotein complex consisting of 28 to 30 subunits, which is not required for basal transcription but is essential for transcriptional activation by most activator proteins. Consistent with its essential role, Mediator is present in the basal transcriptional machinery or preinitiation complex.
Another type of coactivator is involved in transcriptional activation by specific transcription factors. This type of coactivator is not a component of the basal transcriptional machinery. Rather, these coactivators are recruited by a DNA-binding transcriptional activator through protein-protein interactions. An example is the coactivator CBP (CREB-binding protein), which interacts with a DNA-binding transcription factor called CREB (see Table 4-1 ).
Become a Clinical Tree membership for Full access and enjoy Unlimited articles
If you are a member. Log in here