Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
This chapter will provide a basic introduction to the human genome and some of the tools used to analyse it. Genomics and molecular biology have developed rapidly during the last few decades, and this chapter will highlight some of these advances, in particular with respect to the impact on our knowledge of the structure and function of the genome. The basic science described in this chapter is fundamental to the understanding of the field of clinical genetics, which is described in the following chapter.
Inheritance is determined by genes, carried on chromosomes in the nuclei of all cells. Each adult cell contains 46 chromosomes, which exist as 23 pairs, one member of each pair having been inherited from each parent. Twenty-two pairs are homologous and are called autosomes . The 23rd pair is the sex chromosomes, X and Y in the male, X and X in the female.
Each cell in the body contains two pairs of autosomes plus the sex chromosomes for a total of 46, known as the diploid number (symbol N). Chromosomes are numbered sequentially with the largest first, with the X being almost as large as chromosome 1 and the Y chromosome being the smallest. This means that each cell (except gametes) has two copies of each piece of genetic information. In females, where there are two X chromosomes, one copy is silent (inactive) (i.e. genes on that chromosome are not being transcribed (see later)).
Each individual inherits one chromosome of each pair from the mother and one from the father following fertilisation of the haploid egg (containing one of each autosome and one X chromosome) by the haploid sperm (containing one of each autosome and either an X or a Y chromosome). The sex of the individual is therefore dependent on the sex chromosome in the sperm: an X will lead to a female (with the X chromosome from the egg) and a Y chromosome will lead to a male (with an X from the egg).
Chromosomes are classified by their shape. During metaphase in cell division, chromosomes are constricted and have a distinct recognisable ‘H’ shape with two chromatids joined by an area of constriction called the centromere. For ‘metacentric’ chromosomes the centromere is close to the middle of the chromosome, and for ‘acrocentric’ chromosomes it is near to the end of the chromosome. The area or ‘arm’ of the chromosome above the centromere is known as the ‘p arm’, and the area below is the ‘q arm’. For acrocentric chromosomes, the p arm is very small, consisting of tiny structures called ‘satellites’. Within the two arms, regions are numbered from the centromere outwards to give a specific ‘address’ for each chromosome region ( Fig. 1.1 ). The ends of the chromosomes are called telomeres. Chromosomes only take on the characteristic ‘H’ shape during a metaphase when they are undergoing division (hence giving the two chromatids).
Chromosomes are recognised by their banding patterns following staining with various compounds in the cytogenetic laboratory. The most commonly used stain is the Giemsa stain (G-banding), which gives a characteristic black and white banding pattern for each chromosome.
In the cell, the chromosomes are folded many hundreds of times around histone proteins and are usually only visible under a microscope during mitosis and meiosis. DNA is composed of a deoxyribose backbone, the 3-position (3′) of each deoxyribose being linked to the 5-position (5′) of the next by a phosphodiester bond. At the 2-position each deoxyribose is linked to one of four nucleic acids, the purines (adenine or guanine) or the pyrimidines (thymine or cytosine). Each DNA molecule is made up of two such strands in a double helix with the nucleic acid bases on the inside. This is the famous double helix structure that was first proposed by James Watson and Francis Crick in 1953, based upon the x-ray diffraction work of Rosalind Franklin and colleagues. The bases pair by hydrogen bonding, adenine (A) with thymine (T), and cytosine (C) with guanine (G). DNA is replicated by separation of the two strands and synthesis by DNA polymerases of new complementary strands. With one notable exception, the reverse transcriptase produced by viruses, DNA polymerases always add new bases at the 3′ end of the molecule. RNA has a structure similar to that of DNA but is single stranded. The backbone consists of ribose, and uracil (U) is used in place of thymine ( Fig. 1.2 ).
DNA is organised into discrete functional units known as genes. Genes contain the information for the assembly of every protein in an organism via the translation of the DNA code into a chain of amino acids to form proteins. DNA that encodes a single amino acid consists of three bases, or letters. With four letters and three positions in each ‘word’, there are 64 possible combinations of DNA, but in fact only 20 amino acids are coded for ( Table 1.1 ). Therefore the third base of a codon is often not crucial to determining the amino acid – a phenomenon known as wobble.
1st Position | 2nd Position | 3rd Position | |||
---|---|---|---|---|---|
T | C | A | G | ||
T |
|
|
|
|
|
C |
|
|
|
|
|
A |
|
|
|
|
|
G |
|
|
|
|
|
A diagram of a typical gene structure is shown ( Fig. 1.3 ). Each gene gives rise to a messenger RNA (mRNA), which can be interpreted by the cellular machinery to make the protein that the gene encodes.
Genes are split into exons, which contain the coding information, and introns, which are between the coding regions and may contain regulatory sequences that control when and where a gene is expressed. Promoters (which control basal and inducible activity) are usually upstream of the gene, whereas enhancers (which usually regulate inducible activity only) can be found throughout the genomic sequence of a gene. The two base pair sequences at the boundary of introns and exons (the splice acceptor and donor sites), identical in more than 99% of genes, are known as the splice junction (see Fig. 1.3 ); they signal cellular splicing machinery to cut and paste exonic sequences together at this point. The first residue of each gene is almost always methionine, encoded by the codon ATG.
Recent estimates based on the genome sequence put the number of genes at less than 23,000, a considerable reduction from earlier estimates. This means that the vast majority of human DNA does not contain a coding sequence (i.e. exons) but is rather an intronic sequence: structural motifs and regulatory regions such as promoters and enhancers. This is distinct from lower organisms (e.g. bacteria), where more than 95% of the DNA is a coding sequence. Just exactly why so much noncoding DNA is present remains somewhat enigmatic but is believed to be linked to the complex layers of gene regulation through interacting regulatory regions. The other key implication of this finding is that the huge complexity of humans compared with other organisms with similar numbers of genes must arise from more subtle regulation of gene expression, rather than greater numbers of different genes.
The central dogma of molecular biology concerns the information flow pathway in cells and can be simply summarised as: ‘DNA makes RNA makes protein, which in turn can facilitate the two prior steps’. These steps are now explained in more detail.
‘Transcription’ is the process of the information encoded in DNA being transferred into a strand of mRNA. During transcription the RNA polymerase, which constructs the complementary mRNA, reads from the DNA strand complementary to the RNA molecule. This is known as the antisense strand, while the opposite strand, which has the same base pair composition as the RNA molecule (with thymidine (T) in place of uracil (U) as mentioned previously), is the sense strand. Gene sequences are expressed as the sequence of the sense strand of DNA, although it is in fact the antisense strand which is read ( Fig. 1.4 ). The vast majority of genes consist of a 5′ untranslated region (UTR) containing response elements to which proteins may bind that influence transcription. The 5′ regions of genes are frequently characterised by elements such as the TATA and CAAT boxes (see Fig. 1.3 ) and are often richer in GC pairs than elsewhere in the genome. This is frequently the case around the 5′ ends of ‘housekeeping’ genes that are constitutively expressed in the majority of tissues. There then follows the transcribed sequence. The expressed coding parts of the gene are known as the exons, while the intervening sequences are known as introns. The coding portion of the gene is often interrupted by one or more noncoding intervening sequences, although numerous examples of single exon genes exist. Initially, the RNA molecule transcribes both introns and exons and is known as heavy nuclear RNA (hnRNA). The exons are perfectly spliced out (as marked by the splice boundary sequences) and a protective cap added before the now mature mRNA exits the nucleus. Hence cytoplasmic mRNA consists only of coding regions flanked by UTRs at the two ends. A polyadenine (poly A) tail is added to most mRNA molecules at their 3′ end, facilitated by the polyadenylation signal found past the stop codon in the coding sequence. This tail, found on the great majority of expressed mRNAs, serves to protect the RNA from degradation prior to translation by the ribosome (see later).
The term ‘translation’ describes the process whereby the cellular machinery reads the mRNA code and creates a chain of polypeptides (i.e. a protein). Once in the cytoplasm, the mRNA message is translated into protein by a ribosome. Ribosomes, consisting of a complex bundle of proteins and ribosomal RNA, attach to mRNA at the 5′ end. Protein synthesis begins at the amino terminal, and amino acids are sequentially added at the freshly made carboxyl end. Amino acids are brought into the reaction by specific transfer RNA (tRNA) molecules. Each tRNA is a single-stranded molecule which folds in a way that allows complementary base pairing between parts of the same strand. The specific configuration allows the tRNA molecule to bind to its specific amino acid. There remains, unpaired, at one end of the molecule, three bases which are complementary to the codon coding for the amino acid. This anticodon binds to the codon of the mRNA and places the amino acid in the correct sequence of the protein (see Fig. 1.4 ). Usually, several ribosomes translate a single mRNA molecule at any one time.
‘Replication’ is the process whereby DNA is copied or replicated to permit transmission of genetic information to offspring. DNA replication is performed prior to cell division, when an identical copy must be made for each daughter cell resulting from division. Replication occurs before mitosis, the normal form of cellular division where resulting cells have identical DNA to the original. Meiosis, the second form of cellular division, occurs during gametogenesis and results in haploid cells (i.e. cells with half the usual complement of DNA). In meiosis the resulting cells (gametes) are haploid (i.e. carry only a single copy of the genomic sequence).
It is important to note that since this dogma was first established in 1958 by Crick, a number of exceptions have been identified. For example, retroviruses (e.g. human immunodeficiency virus (HIV)-1) can cause information to flow from RNA to DNA by integrating their genome (carried as RNA) into that of the host. A second example is ribozymes, which are functional enzymes composed solely of RNA and hence have no need to be translated into protein.
When a gene is actively being transcribed into mRNA and then translated into a protein, it is said to be ‘expressed’. Gene expression can be controlled at several levels. Transcription of DNA into mRNA is generally regulated by the binding of specific proteins, known as transcription factors, to the region of DNA just upstream, or 5′, of the coding sequence itself. Other proteins can bind enhancer sequences that may be within the gene or a long way upstream or downstream.
The promoter contains specific DNA sequence motifs which bind transcription factors. In general, transcription factors become active when the cells receive some form of signal and then translocate to the nucleus, where they bind to specific sequences in the promoters of specific genes and activate transcription. Other genes, often known as housekeeping genes, have a constant level of expression and are not induced in this way.
Many different types of transcription factor exist with different modes of action. Typical examples of two types will be considered here, namely intracellular nuclear hormone receptors (which are transcription factors) and cell surface receptors, which are capable of activating transcription factors.
Members of the nuclear hormone receptor superfamily, such as the progesterone receptor and the thyroid hormone receptor, are present mainly in the cytoplasm of the cell. When a steroid hormone crosses the lipid bilayer of the cell membrane, it binds to the receptor which is usually dimerised to form pairs of receptor molecules. The receptor/hormone dimer complex then translocates to the nucleus and binds to response elements in the promoters of target genes, where it activates (or indeed represses) transcription. This process also involves the recruitment of many other cofactors to the dimer complex which are also involved in regulation of the expression of the target gene.
Cell surface receptors, subsequent to binding of ligands, can activate pathways leading to the formation of active transcription factors. For example, activation of tyrosine kinase–linked receptors on the cell surface may lead to a series of phosphorylation events within the cell, culminating in the phosphorylation of the protein Jun. Jun will then combine with the protein Fos to form a dimer transcription factor called AP-1, which can bind to specific AP-1 binding sites in the promoters of responsive genes.
In another example of cell surface receptor action, the ‘inflammatory’ transcription factor NF-κB exists in the cytoplasm of cells as dimers bound to an inhibitory protein IκB. Mediators of inflammation, such as the inflammatory cytokine interleukin-1β, bind to cell surface receptors and activate a chain of biochemical events that result in the phosphorylation and subsequent breakdown of IκB. Uninhibited NF-κB dimers then translocate to the nucleus to activate genes whose promoters contain NF-κB DNA-binding motifs.
Gene expression can also be controlled by regulation of the stability of the transcript. Most mRNA molecules are protected from degradation by the presence of their poly-A tail. Degradation of mRNA is controlled by specific destabilising elements within the sequence of the molecule. One type of destabilising element has been well characterised. The Shaw–Kayman or AU-rich sequence (ARE) is a region of RNA, usually within the 3′ UTR, in which the motif AUUUA is repeated several times. Rapid response genes, whose expression is rapidly switched on and then off again in response to some signal, often contain an ARE within their 3′ UTR. Binding of specific proteins to the ARE leads to removal of the mRNA’s poly-A tails and then to degradation of the molecule.
The field of epigenetics is concerned with modifications of DNA and chromatin that do not affect the underlying DNA sequence. In recent years, the importance of these modifications has come to light, and this is now a very active area of research.
Become a Clinical Tree membership for Full access and enjoy Unlimited articles
If you are a member. Log in here