Genetics and Pediatric Rheumatic Diseases


Introduction

Genetics is the study of the relationships between genetic variation (genotypes) and heritable traits (phenotypes). When these investigations are expanded to explore genotype–phenotype relationships across the entire genome, they are called genomics . Together, genetics and genomics investigations are an important starting point for the study of human disease, seeking to discover genetic variants that are relevant to disease pathophysiology, prognosis, and response to treatment. Genetics and genomics are directly applicable to pediatric rheumatic disease research, where the pathophysiologic mechanisms of virtually every disease and condition are either incompletely understood or completely unknown.

Viewed through the lens of genetics, human diseases can be divided into two categories: diseases for which genetic mutations are causative (monogenic diseases) and diseases for which genetic variants influence susceptibility (genetically complex diseases). Monogenic diseases are caused by high-impact genetic variants or mutation (s) of a single gene and they are usually transmitted in families in a mendelian fashion, meaning autosomal dominant, autosomal recessive, or sex-linked inheritance. Examples of monogenic diseases in the pediatric rheumatology clinic include the rapidly growing family of primary autoinflammatory syndromes, as well as a variety of metabolic and collagen-related disorders that also manifest as musculoskeletal abnormalities. In many cases, known monogenic diseases can be molecularly diagnosed with clinical sequencing. Moreover, recognition of patients and families likely to have mutations by astute clinicians is essential to advancing the process of genomic investigation and discovery.

In contrast to monogenic diseases, polygenic or genetically complex diseases are those that are influenced by the interaction of multiple genetic and environmental factors. Most pediatric rheumatic diseases are genetically complex, including childhood systemic lupus erythematosus, juvenile dermatomyositis, and each of the seven subtypes of juvenile idiopathic arthritis. Unlike studies of monogenic diseases, which are typically undertaken in families, studies exploring genetic factors that influence genetically complex diseases are usually performed in populations, where both the size and ancestral composition of the populations are critically important.

The purpose of this chapter is to act as a primer to genetic investigations through the lens of pediatric rheumatology, as opposed to a content-based review of specific diseases, which can be found in the disease-specific chapters. Here, we discuss how genetics and genomics approaches can be applied to pediatric rheumatic diseases while providing examples of how technological advances are already facilitating the investigation of pediatric rheumatic diseases.

Genes, Variation, and the Human Genome Sequence in Biomedical Research

Gene Structure, Function, and the Central Dogma of Molecular Biology

The human genome is the essential blueprint for the assembly of cells, tissues, and organs into a human being. The initial drafts of the human genome sequence were reported in 2001 , and the first human reference genome assembly was published in 2004. The human reference genome assembly has undergone numerous rounds of revision and improvement, and today the human reference genome assembly is in its 38th version (Genome Research Consortium human build 38 or GRCh38). ,

The human genome is composed of 3.2 billion deoxyribonucleic acid (DNA) base pairs that are organized into 23 pairs of chromosomes. Interspersed across these chromosomes are over 44,000 genes , the functional units of heredity whose specific sequences encode the cellular machinery of life. Genes contain a combination of exons, which are gene segments that encode protein sequences, and introns, which do not encode protein sequences. In most cases, genes are composed of a series of alternating exons and introns that begin and end with an exon, and their boundaries are defined by the first nucleotide of exon 1 and the final nucleotide of the last exon. Despite this apparently stereotypical structure of genes, the number, size, and distribution of exons and introns within genes, as well as the overall size of individual genes, are all highly variable.

According to the central dogma of molecular biology, the primary function of genes is to generate proteins ( Fig. 2.1 ). A gene’s DNA sequence acts as the template for transcription , which produces a ribonucleic acid (RNA) copy of the entire gene, known as the primary transcript. The primary RNA transcript is subsequently processed and spliced to remove the introns, leaving only the exons in the mature RNA transcript or messenger RNA (mRNA) of the gene. Adding an additional layer of complexity to the system, alternative splicing of primary RNA transcripts can generate more than one distinct mRNA transcript from a given gene, and alternative transcripts have been identified for 95% of genes. As a result of alternative splicing, the human genome can produce a diverse mRNA repertoire of over 323,000 mRNA transcripts, which represents nearly eight times more mature transcripts than unique genes. These mature transcripts act as templates for translation , the process through which the nucleic acid sequence is translated into the amino acid code. Each amino acid is encoded as a trinucleotide sequence or codon that is uniquely recognized by a transfer RNA (tRNA) molecule. In addition to their specificity for a unique codon sequence, each tRNA molecule also binds to a unique amino acid. The tRNAs employ these dual binding specificities to facilitate the sequence-specific assembly of amino acids into proteins. Translation is initiated and terminated at specific codons. Beginning with the initiation sequence (start codon), amino acids are sequentially added to the chain according to the codon sequence of the mature transcript. When the translational machinery encounters a termination sequence (stop codon), synthesis of the protein is complete and it is released.

Fig. 2.1, Central dogma of molecular biology. The central dogma of molecular biology holds that the human genome is composed of deoxyribonucleic acid (DNA) that encodes genes. The DNA is copied into ribonucleic acid (RNA) transcripts through a process called transcription , and the RNA transcripts are then processed and spliced. Finally, mature or spliced transcripts act as a template for protein synthesis or translation.

Within mature transcripts, there are segments of exons that precede the initiation codon or follow the termination codon. These segments of exons are not translated into protein and are referred to as the untranslated regions (UTRs) . UTRs that precede the start codon are called the 5′ UTR or leader sequence, while those that follow the stop codon are called the 3′ UTR or trailer sequence . Collectively, the UTRs are essential regulatory regions that govern the initiation and termination of translation in a sequence-specific manner.

Noncoding RNA

Despite the assumptions of the central dogma, more than half of known genes, accounting for almost one-fifth of the genome, do not encode proteins. Rather, these genes encode a variety of noncoding RNA (ncRNA) molecules that participate in diverse gene regulatory functions, often in a cell-type specific manner. ncRNAs can be subdivided into groups based on their structure and function, with major categories including long ncRNA (lncRNA), microRNA (miRNA), PIWI-binding RNA (piRNA), small nuclear RNA (snRNA), small nucleolar RNA (snoRNA), self-splicing RNA, and telomerase RNA. lncRNAs are pleiotropic regulators of gene structure and function that interact with specific DNA sequences, RNA molecules, and proteins. The genes that encode lncRNAs may be located within introns or in the intergenic regions that separate genes, and they produce RNA transcripts that are generally larger than 200 bases in length. lncRNAs can influence changes in chromatin structure, acting as scaffolds for the recruitment of components of the chromatin-remodeling complex. They can act as transcriptional activators by recruiting transcription factors (TFs) to gene regulatory sites, but in other contexts they can sequester TFs, excluding them from regulatory sites and leading to transcriptional repression. lncRNAs can bind directly to mature transcripts to regulate translation, alternative splicing, and mRNA stability in a sequence-specific manner. miRNAs also participate in posttranslational regulation of gene expression through sequence-specific binding of mRNA. Primary miRNA molecules, which are usually greater than 80 bp in length, are sequentially processed to generate the 22-bp active miRNAs. Through direct binding of complementary sequences within the UTRs of mature transcripts, miRNAs can either prevent translation of the mature transcript or facilitate their degradation by targeting them for the RNA-induced silencing complex (RISC). Both piRNAs and snRNAs are classes of short RNA molecules that negatively regulate expression of their gene targets.

Gene Regulation

The combination of protein-coding and ncRNA genes is estimated to constitute over 50% of the human genome. However, gene segments that actually encode proteins account for a mere 1.5%, meaning that approximately 98.5% of the human genome is noncoding DNA. This observation led some to posit that noncoding DNA was somehow unimportant or “junk,” an impression that was reinforced by the prevailing protein-centric view of the genome. Despite this provocative hypothesis, it was generally understood that noncoding regions of the genome in close proximity to genes contained regulatory elements that controlled the expression of genes through the recruitment and binding of TFs. For example, promoters are segments of noncoding DNA immediately preceding genes that, upon binding by TFs and RNA polymerase II, facilitate transcriptional initiation. Similarly, enhancers are regions that can engage with TFs and gene promoters to enhance or modify gene expression. However, it was not until the Encyclopedia of DNA Elements (ENCODE) project that the full scope of gene regulation by noncoding DNA became apparent. With a goal of cataloging all noncoding DNA elements in the human genome, ENCODE elucidated specific protein binding sites, chromatin accessibility, and long-range chromatin interactions across the genome in a variety of cell types. It revealed that noncoding regions both proximate ( cis ) and distant ( trans ) from a gene can regulate the expression of a gene through their interactions with various regulatory proteins. Strikingly, ENCODE revealed that more than 80% of the genome has an identifiable regulatory function, such as TF binding or acting as a scaffold for histone binding or chromatin remodeling.

Genetic Variation

If the human genome is the blueprint for the assembly of human beings, then genetic variation is the primary source of heritable phenotypic heterogeneity within the species. Genetic variation refers to places in the genome where differences exist between individuals, and the different versions of a particular variant are called alleles . It is estimated that individual human genomes are 99.9% identical, with a typical person’s genome differing from the human reference genome assembly at roughly 4 to 5 million nucleotide positions. Variation in the genome arises from replication errors of mitosis or meiosis that alter the DNA sequence in an individual, and the variant alleles are transmitted to offspring. Variants that produce beneficial effects and confer a survival advantage are favored by natural selection and their frequency will rise in the population over time. On the other hand, the population frequencies of variants that detrimentally affect the fitness of individuals and reduce individual survival would be expected to remain low. On occasions when variants predispose the host to a set of negative consequences or phenotypes, but also afford a context-specific survival advantage, the survival advantage may facilitate expansion of a disease-causing allele into populations. One such example may involve familial Mediterranean fever (FMF), where the carrier frequency of disease-causing MEFV mutations is relatively high in Mediterranean and Middle Eastern countries. It has been demonstrated that FMF-causing variants are protective against certain pathogenic Yersinia species, including Y. pestis, which might explain the greater than expected expansion of FMF-causing mutations in certain populations. ,

Genetic variants can be classified by their allele frequencies in populations, with common variants defined by allele frequencies greater than 5%, low frequency variants by allele frequencies of 0.5% to 5%, and rare variants by allele frequencies less than 0.5%. The vast majority of genetic variants among human populations are common, indicating that the variants developed many generations ago and have survived the pressure of natural selection as they were transmitted into populations. In the case of rare genetic variants, it is usually the case that they have either developed relatively recently or they detrimentally alter the fitness of affected individuals.

Genetic variants may involve the substitution, addition (insertion), or subtraction (deletion) of nucleotides, as well as cases of chromosomal rearrangements, such as translocations, duplications, and inversions. The simplest type of genetic variant is the single nucleotide polymorphism (SNP) or single nucleotide variation (SNV), in which one nucleotide base is substituted for another. Regardless of the variant type, the location of the genetic variant is a critical determinant of whether the genetic variant will lead to a functional consequence. Genetic variants can be grossly divided into two groups, variants that alter the amino acid sequence of proteins and variants that do not. More specifically, genetic variants that alter protein are more likely to have a deleterious effect than noncoding variants. The types of variants that change the amino acid sequence of proteins include missense variants, nonsense variants, splice-site variants, and insertions or deletions (indels). Missense variants are exonic SNPs that result in the substitution of one amino acid for another in protein products. Similarly, nonsense variants are SNPs that lead to the replacement of an amino acid with a stop codon, resulting in the premature termination of protein products. Indels may be further divided into in-frame indels, which add or subtract nucleic acids in multiples of three and retain the mRNA reading frame, and frameshift indels, in which the nucleotide bases are added or subtracted in quantities that alter the mRNA reading frame after the indel. Finally, splice site variants may affect the splice site at or near an intron-exon junction, leading to abnormally spliced mature transcripts. Despite the fact that noncoding genetic variants do not alter the structure or sequence of proteins, they can modify regulatory regions of the genome that control gene expression in ways that ultimately change the level of gene expression. Genetic variants that correlate with the expression level of a gene are called expression quantitative trait loci ( eQTLs ).

Since the year 2000, there has been an explosion of population-based studies seeking to identify the scope of human genetic variation. Whereas the Human Genome Project sought to establish the general framework of the human genome through the examination of a small number of individuals, studies like the HapMap Project, , the 1000 Genomes (1KG) Project, the Exome Aggregation Consortium (ExAC), and the Genome Aggregation Database (gnomAD), used genomic approaches to characterize genetic variation in reference populations from thousands to hundreds of thousands of subjects. The data sets generated by these projects are accessible through publicly available data browsers, and in some cases the primary data are also publicly available. As of August 2019, sequencing studies of humans have collectively identified over 695 million genetic variants in the human genome. An important limitation of these studies is that they have largely been undertaken in populations of European ancestry, leaving the rich pool of variation in African and Asian populations relatively unexplored. Fortunately, many large-scale studies of genetic variation in African and Asian populations are underway.

Techniques for Genetic and Genomic Investigation

Genetic and genomic investigations, whether for research or clinical purposes, depend on the ability of investigators to examine nucleic acid sequences and identify genetic variation. This requires tools to accurately assess genetic sequence and characterize sequence variation without bias. The methodological repertoire to achieve this task is ever growing and includes approaches that individually examine one or more known genetic markers, such as SNP genotyping and microsatellite markers, as well as a growing number of sequencing-based approaches that can be used to study individual genes but are also scalable to evaluate entire genomes.

SNP Genotyping

SNP genotyping is a method for interrogating known genetic variation at specific genomic positions. Most approaches to SNP genotyping employ pairs of sequence-specific oligonucleotides that specifically hybridize to either wild type or variant alleles. These approaches are useful for examining individual SNPs, or they can be multiplexed to simultaneously interrogate hundreds of thousands to millions of SNPs across the entire genome. These approaches are also highly scalable to screen large numbers of patients simultaneously using high-throughput automated workflows. SNP genotyping has been widely adopted and used in genomic investigations of both monogenic and genetically complex diseases. In monogenic diseases, SNP genotype data serve as the basis for gene-mapping approaches, such as linkage analysis or homozygosity mapping, which seek to identify specific segments of the genome that harbor disease-causing mutations. SNP genotyping is also valuable for the investigation of genetically complex diseases, where large panels of SNP genotypes from across the genome are used to perform genome-wide association studies (GWAS), the gold standard investigative approach for the study of genetically complex diseases. In addition to their importance for gene-mapping studies, SNP genotyping arrays are also useful for sample identification and tracking and management in biorepositories. Although SNP genotyping has been and continues to be valuable for examining known genetic variants, the future usefulness of SNP genotyping is limited by its inability to discover new genetic variants or mutations. To overcome this limitation, one must adopt sequencing-based approaches that “read” through the DNA sequence, one base at a time.

Conventional DNA Sequencing

The first DNA sequencing approaches emerged and matured in the second half of the 20th century. , These methods served as the basis for the Human Genome Project, which sought for the first time to sequence the entire human genome. For conventional sequencing in the Sanger method, genetic regions of interest are either cloned or amplified from genomic DNA by the polymerase chain reaction (PCR), and the cloning/amplification product is used as the template for dye-terminator sequencing reactions. The dye-terminator reaction products are then separated by capillary electrophoresis and the sequences are detected by a sequencing instrument, producing chromatograms or tracings of nucleotide sequences that reveal sites of sequence variation. Conventional sequencing approaches are still used today to investigate targeted regions of the genome for both clinical and research purposes. Clinically, conventional sequencing is used to identify mutations in subjects suspected to have known monogenic diseases, which might include Marfan syndrome, familial osteochondral dysplasia, or one of the monogenic autoinflammatory syndromes. In the research setting, conventional DNA sequencing can be used to confirm the accuracy of nucleic acid–based reagents, including manipulated DNA primers, constructs, and vectors.

You're Reading a Preview

Become a Clinical Tree membership for Full access and enjoy Unlimited articles

Become membership

If you are a member. Log in here