Transcription and Epigenetic Regulation


Acknowledgment

The work was supported by Public Health Service NIH Grant R01-DK55732 and R37-DK45729 to JLM.

With the human genome sequencing project completed in 2001, perhaps the most important piece of information that we have learned is that the clues to our genetic destiny are contained in more than just the primary sequence of DNA encoding 20,500 proteins. Apparently, what distinguishes man from other life forms and most interestingly other mammals lies in the complex modifications, organization, and function of the 3.3 billion nucleotides (nt). Not only are these ~ 20,500 genes alternatively spliced, but their DNA, RNA, and protein products are chemically modified so as to change gene function. Therefore, as opposed to our genetic template being composed of a mere 20,500 genetic units, we are actually controlled by 20,500 to the nth power. The exponent has yet to be determined, but likely results in an enormous, perhaps infinite, combination of genetic events. This chapter will briefly summarize our basic understanding of gene expression, but will focus primarily on the new concepts and technologies of gene regulation in the postgenomic era. Arguably, the major advances since the 5th edition of this textbook continue to be the explosion in our understanding of noncoding RNAs, the impact of epigenetics, chromatin topology, and the refinement of high-throughput techniques.

Overview of Gene Organization

Nucleic Acids

The molecular definition of a eukaryotic gene is complex, but in the simplest terms, it is a nucleic acid sequence that encodes one polypeptide and one messenger ribonucleic acid molecule (mRNA). Genes are comprised of “two intertwining polymers” of deoxyribonucleic acids (DNAs) that are noncovalently attached to a variety of proteins, including histones and specialized proteins (e.g., polymerases and various accessory proteins). The association of DNA, histones, and specialized nuclear proteins collectively is called chromatin. Chromosomes are comprised of continuous strands of chromatin that have been compacted by supercoiling and looping so as to fit into the nucleus ( Fig.1.1 ). The steps governing the compacting and location of chromatin are now an area of intense investigation and will be discussed in Section 1.2 . Chromosomes are the basic heritable unit in the mammalian cell. In humans, there are 46 individual chromosomes or 23 chromosome pairs. The smallest unit of the DNA polymer is a nucleotide—a base attached to the first carbon of a five-carbon sugar phosphorylated at its fifth carbon ( Fig. 1.2 ). Nucleosides do not contain phosphates linked to the pentose sugar, thus differing from nucleotides, which contain one, two, or three phosphate groups. The type of base distinguishes the 4 nt found in DNA: adenine (A), thymine (T), cytosine (C), or guanine (G). They are bases because of the nitrogen groups contained within their single-ring (thymine, cytosine, or uracil) or double-ring structures (adenine or guanine). DNA contains the sugar deoxy ribose, whereas RNA contains the sugar ribose and the base uracil (U) instead of thymine. CpG islands are dinucleotides consisting of a deoxycytidine in the 5′ position adjacent to deoxyguanosine. These dinucleotides are “hot spots” for enzymes (e.g., DNMTs = DNA methyl transferases), which add a methyl group to the 5th carbon of the cytosine ring. The “p” indicates that one phosphate group separates these two nucleosides. This epigenetic mark blocks the expression of DNA and is a mechanism used frequently by gastrointestinal (GI) cancers to silence genes that block their ability to proliferate.

Fig. 1.1, Chromatin structure and organization. Each chromosome exists in a haploid (germ cells) or diploid/tetraploid state depending on their stage in the cell cycle. The short arm of the chromosome relative to the centromere is the

Fig. 1.2, Nucleic acid structure. A nucleoside consists of a purine or pyrimidine base covalently linked to the firs carbon of the pentose ring. The addition of one, two, or three phosphate groups is a nucleotide mono-, di-, or triphosphate. The type of sugar determines the type of nucleic acid: ribose in ribonucleic acids (RNAs) and deoxyribose in deoxyribonucleic acids (DNAs).

Nucleic Acid Polymers: DNA, RNA

Polymers of nucleotides or nucleic acids (also called nucleoside mono-, di-, or triphosphates) are formed when the free phosphate group attached to the fifth carbon of an adjacent nucleotide of the pentose sugar condenses with the hydroxyl group on the third pentose carbon to produce two ester bonds and water (phosphodiester bond). Accordingly, the proximal end of each DNA strand (5′ end) contains a phosphate group at the fifth carbon of the deoxyribose sugar residue. The terminal nucleic acid at the 3′ end of each DNA strand contains a free hydroxyl group at the third carbon of the deoxyribose ring. By convention, nucleotide sequences are written from 5′ to 3′ reading from left to right with the sense strand presented as the upper strand. The antisense strand, written on the bottom, is antiparallel and complementary to the sense strand so that the 5′ to 3′ direction proceeds from right to left. Each nucleotide within the polymer is base paired with a particular nucleotide on the opposing strand by hydrogen bonds: adenine with thymine and guanine with cytosine. The DNA strand containing the same sequence as the messenger RNA (mRNA) is designated the sense strand , and the strand that it pairs with is designated the antisense strand . The antisense strand becomes the template sequence that will be transcribed by RNA polymerase II (Pol II) into mRNA and subsequently translated into amino acids.

Most of the studies on transcriptional control focus on genes transcribed by the 12-subunit enzyme Pol II and thus are designated as class II genes. It is Pol II that is responsible for transcribing gene sequences into protein-encoding messenger RNA (mRNA). Less than 2% of total RNA in the cell is mRNA. Many of these initial primary transcripts (hnRNA for heterogeneous nuclear RNA) are further processed as discussed below. Therefore, 98% of the nucleotides in the human genome do not reside in exons (sequences that encode proteins). Nevertheless, at least 50% of the noncoding RNA is transcribed and serves a function. Nine percent of cellular RNA is hnRNA, the bulk of which are small nuclear RNAs (snRNA, e.g., U2 involved in RNA splicing, 4%) and small nucleolar RNAs, for example, U22 snoRNA, comprising 1%. The other 4% of hnRNA is mRNA. An additional 1% of total cell RNA is microRNA (miRNA), previously called guide RNA (gRNA), which edits mature mRNA transcripts. RNA polymerase I (Pol I) transcribes all of the ribosomal genes except for the 5S gene. Ribosomal RNA represents about 75% of the RNA in the cell and is therefore essential for translation. RNA polymerase III (Pol III) transcribes the 5S ribosomal gene and the genes-encoding transfer RNA (tRNA). Transfer RNA represents about 15% of the total RNA in the cell. Pol I and III transcribe genes that will not be further translated into peptides and noncoding RNA transcripts, although their primary transcripts are also processed before reaching the cytoplasm. Since Pol II transcribes genes-encoding proteins, peptides, long noncoding RNA (lncRNA), and miRNAs, Pol II-regulated genes will be the primary focus of this chapter.

Gene Composition

A gene is analogous to a long sentence read from left to right and comprised of letters organized into words separated by spaces and punctuations. Specific DNA sequences “punctuate” the gene with important start and stop signals for transcription and translation. Several hundred to several thousand DNA base pairs (bp) may comprise one gene. These bp (the alphabet) are organized into functional groups (phrases) on the basis of whether a particular sequence is untranscribed, only transcribed (RNA), or both transcribed and translated (RNA and protein) ( Fig. 1.3 ). Exons are DNA sequences that are transcribed into mRNA by Pol II and exit the nucleus. Within the cytoplasm, exons may or may not be translated into peptides. Those exons that are transcribed and translated form the coding sequences (coding exon). In general, the term intron is used to describe the intervening DNA sequence that is transcribed but is subsequently removed from the primary transcript by RNA splicing (RNA processing) before exiting the nucleus as a mature transcript. However, it is now clear that many transcribed DNA sequences generate small noncoding RNA transcripts such as miRNAs or lncRNAs that can inhibit or modulate protein-coding genes in “cis or trans.” LncRNAs are commonly defined as transcripts that are > 200 nt that do not encode a protein compared to the significantly shorter miRNAs.

Fig. 1.3, Gene structure, transcription, and posttranscriptional processing. A gene is comprised of several hundred to several thousand bp, subdivided into functional elements. The locations of 5′ and 3′ untranslated sequences, exons, and introns are shown. The 5′ flanking sequences contain specific DNA elements (e.g., TATA box). RNA polymerase II transcribes DNA into heterogeneous nuclear RNA (hnRNA) during transcription. Twenty bp after the sequence AATAAA is transcribed to AAUAAA, mRNA is cleaved and the polyadenylate (poly(A)) tail is added to the 3′ end. A methylated guanylate residue is added to the 5′ end of the mRNA through a triphosphate linkage. Prior to exiting the nucleus, intron segments are removed by splicing factors during posttranscriptional processing.

DNA sequences or elements that regulate transcription and are not transcribed into mRNA usually reside in the 5′ portion of a gene upstream (to the left) of the promoter. The promoter is a cluster of DNA sequences that binds Pol II in concert with accessory proteins to initiate the synthesis of mRNA. Accessory proteins control the accuracy and rate of polymerase binding. The first nucleotide transcribed into mRNA is assigned the number 1 with subsequent nucleotides (downstream or to the right of the promoter) assigned positive numbers as transcription proceeds toward the 3′ end. Nucleotides preceding the promoter (upstream or 5′) are assigned negative numbers. DNA sequences that encode a polypeptide (open reading frame) begin with the translational start site codon ATG (encoding methionine) and end with one of the three stop codons: TAA, TAG, and TGA. Thus, the translational start and three stop codons, respectively, are transcribed into mRNA as AUG, UAA, UAG, and UGA. Since there are four different DNA bases and it takes only three bases (a triplet) to encode an amino acid. There are 4 3 = 64 possible codons for 20 amino acids. In this way, the nucleotide code for proteins is considered “degenerate.” The redundant genetic code protects against the deleterious effects of mutations as detailed in the next paragraph. In addition, two or three peptides can be encoded by overlapping codons simply by shifting the reading frame by 1 or 2 nt. Regulatory sequences that are transcribed but not translated reside at both the 5′ and 3′ ends of the mature RNA transcript. Both 5′ and 3′ untranslated regulatory sequences, which range from 10 to several thousand nucleotides, participate in the fidelity of translation and mRNA stabilization or destabilization.

The degeneracy of the genetic code (several codon triplets encoding one amino acid) is what makes some bp changes (mutations) within an exon exhibit no deleterious phenotype. The bp change is designated synonomous if the same amino acid is substituted (also known as a silent mutation) or nonsynonomous if a different amino acid is substituted. Strictly speaking, mutations mean that there has been a bp change whether or not the change affects the type of amino acid inserted into a peptide. Despite a nonsynonomous mutation in the coding sequence, the amino acid substitution might not exhibit a change in the physical characteristics (phenotype) of the organism nor render phenotypic advantages or disadvantages to the organism. Changes in the genetic code that put an organism at a disadvantage and contribute to disease are what we commonly call “mutations.” BP changes in DNA that are neutral or impart a positive or negative advantage to the organism are also known as single nucleotide polymorphisms (SNPs). These SNPs can render subtle differences in the way an organism responds to its environment or other genetic influences ( Fig. 1.4 ). SNPs are a focus of intense investigation due to their use in genome-wide scans to identify genes contributing to common multigene disorders, for example, diabetes, hypertension, etc.

Fig. 1.4, Single nucleotide polymorphism (SNP). Schematic diagram of a SNP in which a protein encoding gene sequence differs between two individuals by one nucleotide.

RNA Species

RNA molecules that encode proteins (except most histone proteins) are distinguished from ribosomal and transfer RNA by the series of adenosines added to the 3′ end of the molecule commonly referred to as the poly(A) RNA tail ( Fig. 1.3 ). This feature is a useful means to isolate mRNA from more abundant RNA species (transfer and ribosomal RNA) and also designates the functional termination of the protein-encoding portion of the gene. During transcription, the primary RNA transcript is cleaved 20 bp downstream of the AAUAAA site at the 3′ end, and ~ 150–200 adenine nucleotides are added to form the poly(A) tail. The 5′ end of the mRNA transcript receives a protective “cap” after synthesis of the first 30 nt that consists of a guanylate residue methylated at the seventh position and linked to the first nucleotide of RNA by three phosphates. The RNA cap is a high-affinity binding site for ribosomes. It should be noted that the element AATAA indicates the site of the poly A tail, but is not necessarily the functional end of the gene. Rather, the 3′ untranslated region (3′UTR) and 3′ untranscribed regions may still contain regulatory elements that modulate gene expression. In fact, most mRNAs bind sequences in the 3′UTR. Therefore, like the 5′ end of a gene, the 3′ end of the gene must be determined empirically.

Two classes of noncoding RNAs transcribed by Pol II have motivated the current expanded interest in RNA biology—mRNAs and long noncoding RNAs. mRNAs (miRNAs) are a class of noncoding RNAs generated primarily from DNA sequences between genes (intergenic) within introns or at the 3′ end of the gene. They were originally identified in plants and worms as posttranscriptional regulators of gene silencing. Pol II and sometimes Pol III transcribe DNA to produce primary miRNA transcripts. In addition, transcription factors modulate the expression of these mRNAs as for protein-encoding genes. For instance, extracellular signaling via typical signal transduction pathways and epigenetic mechanisms regulate the expression of mRNAs. The gene product is RNA rather than protein and exerts its effect on its own locus as well as multiple loci due to their small size and less stringent binding requirements. In this way, miRNAs are thought to regulate at least one-third of all human genes.

miRNAs are synthesized in the nucleus as a primary transcript (pri-miRNA) capable of forming several hairpin structures through internal complementarity ( Fig. 1.5 ). The microprocessing complex containing a nuclear RNase III endonuclease called Drosha and the DiGeorge syndrome critical region 8 protein (DGCR8) cleaves the pri-miRNA transcript. The Drosha protein complex removes flanking segments and an ~ 11 bp stem region. This step converts the pri-miRNA to precursor miRNAs (pre-miRNAs). Pre-miRNAs are typically 60–70-nt long hairpin RNAs with 2-nt overhangs at the 3′ end. The nuclear export receptor exportin-5 and RanGTP transport the pre-miRNA into the cytoplasm where it is further processed by a complex containing another RNase III endonuclease called Dicer. Dicer partners with RNA-binding proteins to cleave the pre-miRNA into 21–25 nt duplexes. The miRNA/miRNA* duplex consists of a guide RNA strand and a passenger strand indicated by an asterisk (miRNA*) that is discarded upon assembly of the R NA- i nduced s ilencing c omplex (RISC). Loading the miRNA/miRNA* duplex into RISC is a four-step process requiring ATP hydrolysis and the major RISC protein component called Argonaute (Ago proteins). Upon unwinding of the duplex, the miRNA* strand is discarded leaving a single strand 21–25 nt RNA molecule available for silencing specific clusters of genes by hybridizing to their 3′UTRs. Ago protein coat miRNAs and along with exosomes protect miRNAs from degradation in biofluids such as blood and urine rendering them potential biomarkers.

Fig. 1.5, Synthesis of microRNAs (miRNA). miRNAs are synthesized from the primary miRNA (pri-miRNA), which are then edited to the pre-miRNA. The RAN-GTP/Exportin 5 complex transports the Pre-RNA to the nucleus where the pre-miRNA is further processed to the miRNA/miRNA* duplex. *miRNA indicates the passenger strand that is discarded upon assembly of the R NA- i nduced s ilencing c omplex (RISC). The Argonaute (Ago) protein are the major protein component of the RISC. TRBP = TAR RNA-binding protein (aka PACT).

Long noncoding RNAs are nucleic acids that do not encode a protein and are at least 200-nt long or greater. They are distinguished from miRNAs by their size (lncRNA > 200 nt versus miRNAs ~ 22 nt) and the ability to exhibit more diverse functions. miRNAs typically suppress multiple gene targets, whereas lncRNAs typically regulate the gene from which they are transcribed, albeit by multiple mechanisms. The advent of whole genome sequencing has identified more noncoding transcripts than coding complicating our ability to define their function. lncRNAs can function in “cis” or “trans,” can circularize or remain linear. Moreover, lncRNAs can function as protein scaffolds by recruiting regulatory complexes to genes, or behave as decoys, signaling molecules or as antisense interference transcripts. Therefore, through these diverse behaviors, lncRNAs exhibit pleomorphic functions such as genomic imprinting, chromosome shaping, and allosterically enzyme regulation. The function of most lncRNAs is unknown and thus the transcripts have simply been named numerically. Those lncRNAs that have been assigned a function include XIST (X chromosome inactivation), HOTAIR (Hox transcript antisense RNA), and TERC (telomerase elongation).

Linking Gene Structure to Function

Previously the 5′ border of a gene was identified by the promoter region (functionally determined) and by the first nucleotide transcribed into mRNA (cap site) determined empirically by various reverse transcriptase methods—for example, primer extension analysis or anchored polymerase chain reaction (PCR and DNAse1 hypersensitivity sites). These techniques used reverse transcriptase to synthesize complementary or copy DNA (cDNA) ( Fig. 1.6 ). Radiolabeled primers complementary to the 5′ end of the DNA sequence to be copied were allowed to anneal to mRNA. Reverse transcriptase then adds deoxynucleotides to the primer in the 3′ to 5′ direction. Synthesis of the cDNA terminates when the 5′ end of the mRNA is reached. Template mRNA molecules were removed by ribonucleases (RNases), and the synthesis of a double-stranded cDNA was completed through the action of DNA polymerase. Because the newly synthesized cDNA was radiolabeled at the 5′ end, the length of the cDNA (and hence the transcriptional start site) was determined by resolving the fragments on a denaturing polyacrylamide gel and comparing the length observed in bp to the known cDNA sequence.

Fig. 1.6, Complementary DNA (cDNA). Primers complementary to a portion of the mRNA are allowed to anneal. For unknown sequences, as in the synthesis of cDNA libraries, a primer complementary to the poly (A) tail is used, i.e., poly (dT). Reverse transcriptase added along with all four deoxynucleotides (dNTPs) will transcribe mRNA in the 3′ to 5′ direction to make copy DNA. The mRNA template is removed by RNases, and double-stranded cDNA is made using DNA polymerase. In primer extension analysis, the 5′ end of mRNA (the cap site) is identified by annealing primers of a known sequence near the 5′ end of mRNA.

In the age of whole genome analysis, the characterization of gene function has lagged behind the generation of transcript mapping. In other words, the biochemical assays such as DNase-seq, ATAC-seq (assay for transposase- accessible chromatin), ChIP-seq, and 3C (chromatin conformation capture) genome-based methods do not provide an assessment of function. This has led to the development of high-throughput methods to identify changes in gene transcription levels (both coding and noncoding). These include RNA-seq and STARR-seq (self-transcribing active regulatory region sequencing. In addition, CRISPR/Cas9 methods of activating or silencing gene in situ have permitted the development of functional readouts for enhancer modification within its endogenous environment.

We now know that these additional DNA sequences might encode noncoding RNA that regulates gene expression in addition to the well-described enhancer sequences. Specific DNA elements called insulator elements mark the boundary of genes. These elements, originally identified on the globin gene, bind an 11-zinc finger transcription factor called CTCF, which is capable of blocking histone acetylation spreading between adjacent genes. More recently, it is now understood that gene expression occurs in insulated neighborhoods generated by chromosomal loops formed by the binding site for CTCF and the cohesion complex. Thus, enhancer or repressor sequences that are kilobases away from the transcriptional start site (TSS) can brought closer to the genes that they regulate by forming gene-enhancer/repressor “neighborhoods” called topologically associated domains (TADs). It has recently been shown that CTCF-binding site mutations that prevent the formation of TADs can cause disease.

Given the requirement for larger and larger pieces of DNA to recapitulate native expression in transgenic mouse models, techniques have been developed to clone and manipulate large pieces of DNA (over 50 kilobases), for example, yeast artificial chromosomes (YACs) and bacterial artificial chromosomes (BACs). Recombineering is a powerful technique performed in bacteria that permits the introduction of foreign DNA or point mutations into these large plasmids that are eventually introduced into transgenic mice, but has been superceded by a powerful new technology called CRISPR-Cas9.

CRISPR/Cas represents the latest and to date the most powerful breakthrough in our ability to modify or manipulate the genome with precision. The term CRISPR stands for clustered regularly interspaced short palindromic repeats and Cas is the abbreviation for CRISPR-associated protein. Cas9 is a nuclease that uses guide RNA to direct the enzyme to the specific DNA sequence to be modified by forming Watson-Crick base pairing. Thus, the technique is a simple, RNA-guided method by which bacteria and Archaea defend themselves from the DNA of invading bacteriophages (adaptive immune mechanism). In short, the technology originates from studying the bacterial immune system and consists of two parts: a DNA-binding domain that recognizes the sequence to be modified and an effector domain that mediates double-strand DNA breakage. These two steps activate the host cell’s sequence-specific endonucleases to repair the break by nonhomologous recombination resulting in modification of the targeted sequence. The specificity of the technology lies in the ability to program the guide RNA. Prior to CRISPR/Cas, zinc-finger nucleases (ZFNs) and transcription activator-like effector nucleases (TALENs) were the primary methods used to execute programmable genome editing.

Epigenetic Influences

Epigenetics, literally means “outside of or beyond genetics,” refers to the “study of genetic modifications that are mitotically and/or meiotically heritable yet do not change the DNA sequence”. Thus, mutations or deletions can alter the length of a gene that in turn alters the primary sequence of the protein. By contrast, epigenetic influences chemically modify the nucleotide or amino acid structure that in turn changes how that particular DNA or (histone) protein is recognized by nuclear proteins without changing the sequence itself. Although it is now clear from the completed sequence of the human genome that there are only about 20,500 gene loci, the complexity of the genetic information encoded in human chromosomes must enlist other features of chromatin. The epigenetic influences on chromatin appear to be one of the critical features that enhance genomic complexity. A major target of epigenetic changes is histones, basic proteins that coat the naked DNA double helix. The N-terminal tails of histones (H1, H2A, H2B, H3, H4) are positively charged due to the basic amino acid lysine. The positively charged histones attach to DNA because of the negatively charged phosphate groups comprising the DNA backbone. The ionic interaction is reduced if the positive charge on the lysines is removed. Specific enzymes called histone acetyltransferases (HATs) acetylate the lysine side group effectively eliminating the positive charge ( Fig. 1.7 ). The loss of the ionic interaction between the histones and phosphate groups on DNA permit greater access to the DNA helix by accessory proteins such as polymerases, transcription factors, and coactivators or repressors. Chromatin becomes “open,” accessible and readily transcribed. By contrast, there are enzymes called histone deacetylases (HDACs) that “close” chromatin by removing the acetyl groups from the lysines at the N-terminal tails of histone proteins. These enzymes are called histone deacetylates (HDACs). Removal of the acetyl group restores the positive charge to histones allowing the ionic interaction between histones and DNA to be restored. Consequently, nonhistone proteins such as polymerases and transcription factors become excluded from DNA, transcription is silenced, and chromatin becomes inactive.

Fig. 1.7, Nucleosome structure and histone modifications on histone tails. (A) The double-strand DNA helix winds twice around a complex of the four core histones assembled as dimmers. Unacetylated histones are positively charged and adhere tightly to the negatively charged DNA preventing access by transcription regulatory proteins. Histones that are acetylated are less positively charged and do not adhere as tightly to chromatin allowing access of regulatory proteins to the DNA. The addition or removal of acetyl groups to the ends of histones is regulated by acetyltransferase (HATs) and deacetylase enzyme complexes (HDACs). The short chain fatty acid butyrate inhibits the activity of HDACs. (B) Shown are the amino-terminal histone residues modified by acetylation, methylation and phosphorylation.

Collectively, histones and accessory proteins associated noncovalently with DNA are what forms chromatin. Chromatin exists in two forms—euchromatin and heterochromatin. Euchromatin contains actively transcribed genes that decondense during DNA replication. Euchromatin is also centrally located in the nucleus. By contrast, heterochromatin contains transcriptionally silent genes that remain condensed at the periphery of the nucleus. The DNA sequences within heterochromatin are repetitive and only 15% of nuclear chromatin is heterochromatin. The major forms of epigenetic modifications in mammalian cells occur on DNA and histones and include such covalent modifications as methylation and acetylation, but also the addition of other organic residues. The most common epigenetic change is DNA methylation. In addition, methylation is currently the only epigenetic change known to occur on DNA. By contrast, histone proteins undergo over 100 types of epigenetic modifications, of which the most common include acetylation, methylation, and phosphorylation. Histones are frequently the target of changes, but nuclear regulatory proteins, for example, transcription factors can also be covalently modified, most commonly by phosphorylation. Epigenetic changes affect such events as chromatin folding, gene expression, X-chromosome inactivation, and genomic imprinting. They are essential for development and differentiation in which clusters of genes must be activated or silenced at precisely timed intervals during an organism’s growth and maturation. In addition, epigenetic changes provide mechanisms by which the environment affects the genome, for example, microbiota, immune disorders, and cancer.

DNA Methylation

DNA methylation is a postsynthesis modification that normal DNA undergoes after each replication. This modification is catalyzed by DNA methyltransferases (DNMTs) and occurs on the C-5 position of cytosine residues within CpG dinucleotides located primarily in the promoter of a gene. There are three major DNMTs (DMNT1, 3A, 3B). Each DNMT plays a distinct and critical role in cells. Murine knockouts of DNMT1 and DNMT3b exhibit embryonic lethality. The DNMT3a homozygous mouse appeared normal at birth but died by 4 weeks of age. In humans, mutations of DNMT3b are linked to ICF syndrome ( I mmunodeficiency, C entromere instability, F acial anomalies). DNMTI functions as the “maintenance” methyltransferase since it functions during cell division to methylate the newly synthesized DNA strand as dictated by the hemi-methylated complementary strand. DNMT3a plays a central role in the methylation of neural specific genes.

Sixty percent of human genes contain a CpG island. While methylation can also occur in other parts of the gene, CpG dinucleotides tend to be underrepresented in the genome and when they are found appear in clusters ranging from 0.5 to several kilobases with GC content greater than 55%. About 15% of CpG dinucleotides cluster in short DNA segments known as CpG islands. The remaining 85% of the islands are spread throughout the genome in repetitive hypermethylated segments that are transcriptionally silent. Methylation of “CpG islands” is a late evolutionary development and functions to maintain genome stability by repressing transposons and repetitive DNA elements.

DNA methylation is an important event in many processes, including transcriptional repression, X chromosome inactivation and genomic imprinting. CpG islands locate in the promoter region of genes about 60% of the time and are normally hypomethylated particularly in the germ cells. Collectively, these CpG clusters or islands cover only about 0.7% of the entire genome, which is still equivalent to several million nucleotides. Hypermethylation at CpG islands induces transcriptional silencing that in turn is stably inherited. Thus as cells differentiate, a significant percentage of these CpG islands become methylated in a tissue specific manner. Typically these would be genes involved in cell renewal. As observed with HDACs and deacetylation, the methylation status of cancers might seem contradictory. Yet, aberrant de novo hypermethylation of CpG islands is a hallmark of some human cancers and occurs early during carcinogenesis. Tumor suppressor genes are locally hypermethylated by some cancers to silence their expression; whereas, oncogenes might be hypomethylated.

The DNA of tumor cells is globally hypomethylated, a process that is linked to nutritional status, for example, B 12 or cobalamin absorption. Cobalamin is required for the synthesis of S -adenosylmethionine, the primary methyl donor in the cell. In this way, reduced cobalamin absorption as sometimes observed in Crohn’s or pernicious anemia would provide an environment favorable to cancer. Niacin required to form NAD, which is necessary for ADP-ribosylation of histones, also affects chromatin structure.

The most precise approach to assessing DNA methylcytosines is through bisulfite sequencing. Treating DNA with sodium bisulfite converts unmethylated cytosines to uracil that when subjected to conventional DNA sequencing are read as thymines. Methylated cytosines are still read as cytosines. Although bisulfite sequencing is not as easy to scale up as a genome-wide analysis by methylation-sensitive restriction enzyme (MSRE) analysis, sequencing is the most accurate way to determine the methylated sites in DNA or the methylome.

Genomic imprinting occurs in gametogenesis and is necessary for development. One of the X chromosomes in females is not expressed due to the heavy methylation of the inactive X chromosome. The epigenetic phenomenon whereby expression of a gene depends on whether it is inherited from the mother or the father is called imprinting and is due to differential methylation of specific cytosine bases on the maternal versus the paternal genes. Recent genome-wide analysis of genomic imprinting in the mouse identified 1300 loci that exhibit parental bias in the expression of specific mRNA transcripts. The gene loci identified control neural systems associated with feeding and behavior. In addition, the authors in a separate article showed preferential selection of the X chromosome inherited from the mother as opposed to the one from the father in glutamatergic neurons of the female cortex. The interleukin-18 gene was identified as an important locus controlling sex-specific preferences.

Histone Modifications

The basic repeating unit of chromatin is the nucleosome. Each nucleosome is composed of 147 bps of DNA wrapped twice around a histone protein octamer consisting of two molecules of each of the four core histones (H2A, H2B, H3, and H4). The linker histone H1 sits alone between each core nucleosome facilitating further compaction. Each histone contains a structured globular domain with a histone-fold motif important for nucleosome assembly, and a highly charged unstructured amino-terminal tail of 25–40 residues, which protrudes from the body of the nucleosome to latch onto the phosphate backbone. The amino-termini are the major sites for histone modifications. Histones can be modified by acetylation, methylation, phosphorylation, ADP-ribosylation, ubiquitination, and sumoylation ( Table 1.1 ). The mixture of these covalent modifications create a “code” on the surface of the histone molecule that is subsequently recognized by a class of chromatin-binding proteins, for example, bromo- and chromodomain-containing proteins that mediate chromatin compaction, transcription, and DNA repair. Acetylation, methylation, ubiquitination, and sumoylation occur on the lysine residues while methylation also occurs on arginine residues. Phosphorylation occurs on serines and threonines, ADP-ribosylation on glutamic acids. Most of these modifications, particularly acetylation, alters the charge distribution on the amino-terminus and alters nucleosome structure, which can in turn regulate chromatin structure. Some covalent modifications act as molecular switches, enabling or disabling subsequent covalent modifications, which explains the functional complexity of epigenetic modifications. Each modification correlates with a specific physical status of chromatin. The next several sections will highlight the most common histone modifications.

Table 1.1
Enzymes, Targets, and Effect of Epigenetic Modifications
Target Covalently Modified Group Adds Removes Effect on Gene Expression a Enzyme Inhibitors
DNA Methyl DNMT Gadd45 ↑ increases or ↓ decreases Azacytadine; RG-108
Histone Acetyl (KAT)/HAT HDACs ↑ increases or ↓ decreases Butyrate, SAHA, trichostatin A, valproic acid
Add to lysines (K) Methyl KMT (SETs, PCG1, 2, TrG) KDM Jumonji (JMjC, Jarid) ↑ if H3K4me3; H3K36me3; H3K79me3 ↓ if H3K9me2,3; H3K27me3 BIX-01294
Add to arginines (R) Methyl PRMTs (CARM1, PRMT1) PADI4 ?
Add to S10H3 Phosphate AurB PP1
Add to lysines (K) Ubiquitin 76 aa peptide Ub ligases (Ring 2) Ub protease (USP)
Add to lysines (K) Sumo = small ubiquitin-like modifiers, ~ 76 aa Ubc9 Ub protease (SUSP)

a "?" means unknown.

Histone Acetylation

Acetylation of histones occurs at the ε-amino side group of specific lysines within the N-termini of histones. HATs transfer an acetyl group from the donor acetyl-CoA to the histone terminal lysines. In hypoacetylated chromatin, the positive charges on unacetylated lysines are attracted to the negatively charged DNA, producing compact, closed chromatin, which represses transcription. By contrast, acetylation of the lysines removes their positive charges resulting in a less compact, open chromatin structure, which facilitates gene transcription. Therefore, HAT activity and subsequently histone acetylation are linked mainly to transcriptional activation ( Fig.1.7 ). Removal of the acetyl group (deacetylation) by HDACs restores the positive charge on lysines, chromatin becomes compacted and less accessible to regulatory proteins required for transcription. Thus, HDACs and deacetylation are primarily associated with transcriptional repression ( Fig. 1.7 ).

The HATs are divided into five families. These include the p300/CBP HATs (p300 and CBP), G c n 5-related a cetyl t ransferases (GNATs, including Gcn5, PCAF, etc.), MYST ( M OZ, Y bf2/Sas3, S as2 and T ip60)-related HATs, the general transcription factor HATs (TFIID subunit TAF250 and TFIIIC), and the nuclear hormone-related HATs (SRC1 and ACTR). The most consistent functional characteristic of HATs is that they are transcriptional coactivators. These proteins are components of large multisubunit complexes that do not bind DNA directly, but instead form protein- protein interactions with DNA-binding transcription factors. The MYST proteins are the largest family of acetyltransferases. More recently, the Gcn5-related acetyltransferases are considered to be part of a complex called SAGA for S pt- A da- G cn5- A cetyltransferase. SAGA preferentially acetylates several N-terminal lysines within H3 and H2B in response to cellular stress, for example, low glucose, hypoxia, and UV damage. Moreover, in addition to its HAT activity, SAGA also has deubiquitinase activity. In summary, the themes that are consistently emerging are first that these histone-modifying enzymes are components of large complexes and second for every enzymatic complex that adds an organic residue to histones, there is a complementary enzymatic complex that can remove them ( Table 1.1 ).

The more numerous mammalian HDACs have been grouped into three protein classes. Class I includes HDACs 1, 2, 3, and 8; class IIA includes HDACs 4, 5, and 7; class IIB includes HDACs 6 and 10; and class IV is comprised of HDAC 11. HDACs 1–11 are zinc-dependent. The class III HDAC family consists of the conserved nicotinamide adenine dinucleotide (NAD)-dependent Sir2 family of deacetylases or sirtuins of which there are 7. The sirtuins are not zinc dependent. Like HATs, HDACs do not bind directly to DNA, but are recruited to genes by large multisubunit complexes to function primarily as corepressors of transcription.

The function of HATs and HDACs is of particular relevance in the GI tract due to the effect of butyrate, a by-product of colonic bacterial fermentation, on histone acetylation ( Fig. 1.7 ). Epidemiologic studies uniformly concur that a diet high in fiber is protective against colon cancer. The short-chain fatty acid butyrate is one of several fiber-derived fermentation products capable of maintaining epithelial cell differentiation. The differentiation effects were initially revealed after treatment of erythroleukemic cells with butyrate. Subsequently, it was discovered that the induction of differentiation by butyrate correlated with histone hyperacetylation due to suppression of HDACs. Thus, the HDAC inhibitory effects of butyrate and resulting histone hyperacetylation might, in fact, be one mechanism by which dietary fiber exerts its anticancer effects. While butyrate is normally used by colonocytes as a carbon source under low glucose conditions, colon cancers use the Warburg effect when glucose is in abundance to generate ATP via glycolysis. The butyrate that is not converted by fatty acid oxidation in the mitochondria to produce ATP is taken up by the nucleus where it suppresses HDACs. Thus, the HDAC inhibitory effect of butyrate depends on the metabolic state of the cell.

Most reviews support the viewpoint that butyrate and HDAC inhibitors are potent anticancer agents. Collectively, early studies emphasized the global effects of butyrate on chromatin remodeling, but the molecular basis for the gene-specific effects of butyrate remains poorly defined. HDAC inhibitors regulate less than 10% of actively transcribed genes. Most of those are upregulated through GC-rich sites. In addition to histone acetylation, it is now known that DNA-binding proteins can become acetylated. Thus, a possible mechanism by which hyperacetylation induced by butyrate might target specific genes is through acetylation of specific transcription factors. The proposed function of acetylated transcription factors varies and includes increased or decreased DNA binding as well as protein stability. In many instances, the genetic targets of butyrate are GC-rich sequences that bind Sp1 and Sp3. Gamma glutamyl transferase, IGF-binding protein 3, G alpha (i2), galectin, Cox 1, and intestinal alkaline phosphatase are all upregulated by butyrate through Sp1 sites. Sp1-binding sites are also implicated in the butyrate induction of p21 WAF1 gene expression. HAT p300 recruited to the p21 WAF1 promoter cooperates with Sp1 and Sp3 to mediate the effects of butyrate. However, Sp1 does not cooperate directly with p300, but instead binds the histone deacetylase HDAC1. The Sp1-HDAC1 complex in turn forms complexes with other corepressors such as Sin3A. Thus, Sp1 appears to be the factor that confers p21 WAF1 promoter repression by recruiting HDAC4 and corepressor complexes.

HDACs have opposing functions especially in cancer. On the one hand, HDACs can prevent the activation of tumor suppressor genes and block the ability of a cancer cell to undergo apoptosis. However, HDAC2 silencing triggers apoptosis. Another important feature of HDACs is their interaction with DNA methylation. HDACs cooperate with DNMTs by removing the acetyl groups blocking methylation targets on histones or DNA.

Histone Methylation

There are two types of histone methylation, targeting either lysine or arginine residues. Histone methyltransferases (HMTs) perform these modifications utilizing S -adenosylmethioine as the methyl group donor. Lysine methylation is implicated in changes in chromatin structure and gene regulation; whereas, arginine methylation correlates with the active state of transcription, like acetylation.

You're Reading a Preview

Become a Clinical Tree membership for Full access and enjoy Unlimited articles

Become membership

If you are a member. Log in here