Clinical genome sequencing


Abstract

Background

Soon after whole genome sequencing was developed, it was validated for use in a clinical setting, now known as clinical genome sequencing or cGS. Mostly cGS is used for rare, undiagnosed diseases, where the patient’s symptoms and/or family history are consistent with an inherited disease. Although costs seem high, ending a diagnostic odyssey using cGS may be cost-effective in overall health care costs. While clinical exome sequencing (cES) of all known genes is a less expensive option, genome sequencing has the potential to improve diagnostic yield with more consistent coverage of the exome, as well as including nonexome regions containing variations that contribute to disease.

Content

This chapter focuses on the unique aspects of cGS as a diagnostic test. Special considerations of standard molecular genetics laboratory processes that apply to cGS will be presented. Current challenges, limitations and opportunities for future directions will be discussed.

Introduction

The rapidly widening accessibility of massively parallel sequencing to clinical laboratories has resulted in a similarly rapid growth in physician requests for clinical genomic testing, analysis and interpretation. Clinical genomic sequencing (cGS) (also referred to as clinical whole genome sequencing, “WGS,” or whole genome analysis, “WGA”) is potentially the ultimate diagnostic test for inherited diseases, being able to identify sequence variation, copy number variation, repetitive variation, intragenic coding and noncoding regions, and intergenic regions. This includes variation in all known nuclear genes, mitochondrial genes, and the human leukocyte antigen (HLA) region, although each of these regions may require specialized analytical pipelines for specific analyses.

Clinicians often request cGS on pediatric patients, including those in newborn intensive care units. The diagnostic yield is dependent on the indication for testing, ranging from about 33% to nearly 70%. The diagnostic yield and clinical usefulness of cGS was first established for neonatal intensive care unit (NICU) patients, with a diagnostic yield of 43%, leading to changes in management that reduced inpatient costs $800,000 to $2,000,000. For such indications, cGS can increasingly be considered a first-tier diagnostic test. Similarly, many adults who have sometimes been years without a diagnosis, or with a suspected misdiagnosis, are also being tested and finally diagnosed by cGS. The range of application of cGS is literally the gamut of genetic disease, including developmental, neurologic, muscular, gastrointestinal, and immunologic conditions. “Personal health” or “elective” cGS are gaining in popularity for healthy adults interested in learning their genetic risks for disease. Genomic sequencing can also identify pathogens in a patient’s specimen, expanding its diagnostic ability.

Some current limitations remain in technical alignment to reference sequences of difficult regions (such as some repetitive regions and pseudogenes) and in interpretation of rare single nucleotide or small insertion and/or deletion variants, regulatory ( cis and trans effects) and deep intronic regions. Advances in the technology, chemistry, and interpretive components are moving the field forward at an unprecedented pace to address these types of issues.

There are both similarities and differences between analyses undertaken in traditional genetic testing (see Chapter 68 ), and in cGS. For reference, the overall process, encompassing preanalytic, sequencing, bioinformatics, and postanalytic steps for cGS is summarized in Fig. 66.1 .

FIGURE 66.1, Process overview of clinical genome sequencing. The typical overall workflow of a clinical genome analysis is shown. As with other types of genetic tests, it commences with a laboratory test request/order and specimen collection and proceeds through “wet lab” sequencing. After sequencing, extensive bioinformatics analysis is performed, both to control and assure quality, as well as to perform preliminary genome annotation and interpretation. Finally, both bioinformatics and professional interpretation and reporting components are involved. Some differences from other types of pathology testing include the requirement for genetic counselling and consent, strict requirements for patient and sample identification, stringent monitoring of both batch and sample quality, and the ability to reanalyze the data from a previous analysis, without actually retesting or resequencing the sample.

An important conceptual difference between standard genetic testing and clinical genomic testing is that standard testing usually requires a working hypothesis of which gene(s) to examine. Traditional genetic testing is usually requested to confirm or rule out a clinical diagnosis based on the clinical symptoms and family history. Examples include testing for targeted variants (such as for the Factor V Leiden variant), single gene sequencing (such as for cystic fibrosis), or multi-gene sequencing panels (such as for hearing loss). In contrast, cGS and cES enable initial simultaneous examination of large numbers of (and in principle, “all”) genes and determine which genes might be affected in the patient. cGS (as well as cES) analysis is guided by the patient’s symptoms and signs (which collectively are referred to as their phenotype) but not necessarily with or by a diagnosis. Instead, the clinician looks to cGS to provide the definitive diagnosis (referred to as a molecular diagnosis) or perhaps to point the direction to a potential diagnosis that could be confirmed by other testing.

Although cGS and cES are similar in testing indications for undiagnosed disease, there are significant differences. Referring to see Fig. 66.1 , differences between cGS and cES are seen in panel 2 (Sequencing) in that cGS does not require capture or enrichment for target regions and can be sequenced directly. Although bioinformatics pipelines (panel 3) are similar, the vastly greater data generated from genomes requires more computing and storage capabilities. While the postanalytical reporting process is the same (panel 4), by interrogating variants in regions not covered by the exome that are less well characterized and understood (regulatory or deep intronic regions), cGS may include more variants of uncertain significance (VUS). Intragenic regions of cGS are first analyzed similar to cES, but analysis can be expanded if no variant(s) identified are likely to be causative of the patient’s symptoms. Sometimes, only one pathogenic variant may be identified in a gene related to the patient’s symptoms where two variants would be required to cause disease in a recessively inherited condition, therefore giving only a partial answer for a recessive disease. In these cases, data generated from cGS can be analyzed beyond the coding regions to identify a possible second variant, although it would likely be classified as a VUS.

This chapter will enlarge on the aspects of cGS that are rapidly entering mainstream diagnostic medicine and will also consider some of the potential applications arising from the ability to reanalyze the genome at will. While cES overlaps considerably with cGS, we will comment where they differ. The chapter focuses on germline (constitutional) genomics and does not seek to cover somatic genomics.

Special considerations that apply to clinical genomic sequencing (cGS)

Consent

For individuals and families who have sought a diagnosis for years, their last hope may be that the genome holds the answers. Before commencing genetic analyses, it is important to set appropriate expectations and ensure that the patient has provided informed consent. Of the approximately 20,000 identified human genes, only about a third of these genes currently are known to be associated with a recognized disease, with new gene–disease associations being made at the rate of approximately one each day. The patient needs to understand that, due to this rapidly expanding but still very incomplete knowledge of the human genome and its function, a genetic test might result in a diagnostic finding with or without management and/or therapeutic implications, or in an ambiguous finding, or even in no informative finding. When no variants that could explain clinical findings are found, the answers may still be in the genome, but in regions or variants difficult to detect through NGS sequencing (e.g., repetitive regions or sequences with high GC content), or in regions that are not clinically interpretable at this time (e.g., regulatory, deep intronic), or a combination of genetic and environmental factors (e.g., some autoimmune disorders). Genetic findings may also have significance for genetic relatives, and one needs to know the patient’s consent in terms of permission to inform relatives, or even permission to approach relatives to elucidate information relevant to the patient. Finally, the laboratory will need to know the patient’s wishes regarding Incidental Findings, and whether consent has been obtained to report any such findings. This is discussed further later in this section.

While most countries have similar approaches recognizing the autonomy and right of the patient to specify such consent, there will be differences between jurisdictions in relation to the age at which legal consent for testing can be given. There may also be regulatory differences in whether this consent can be given orally, or whether it requires written consent, and whether this consent is given to the referring clinician, or to the laboratory itself.

Samples

Most samples for clinical genomics use venous blood, with the DNA examined extracted from the nucleated white blood cells in the sample’s buffy coat. Newer sampling techniques are less invasive and can make use of DNA from buccal swabs or saliva samples. Isolated DNA from saliva and buccal samples typically give lower yield and concentration, which may affect sequencing coverage. Saliva and buccal specimens will have increased microbial DNA, which will also be included in the sequencing data generated. Microbial DNA will compete with the human DNA, resulting in possibly lower read depths for the human sequences. Quality metrics may need to be adjusted for the different sample types. Although other samples could in theory be used (e.g., hair, skin biopsy), clinical laboratories may not receive sufficient numbers of such samples to undertake validation of multiple sample types unless they operate in specific referral catchments (such as forensic populations). Other sample types may have a higher failure rate due to difficulties and variation in uniform extraction of high molecular weight DNA at sufficiently high yield (i.e., at least 0.5 μg). Note that while the germline (“chromosomal”) genome will be relatively invariant between these sample types, some genetic conditions have variants that are localized to particular cell lines and tissues (such as muscle biopsies) or from cells obtained from urine samples (which may reflect mitochondrial heteroplasmy better than peripheral blood samples).

Sequencing

As described in Chapter 65 , the currently dominant sequencing platforms are based around massively parallel “short-read” sequencing technologies. These use “libraries” prepared by random fragmentation of the genome, or by random or targeted amplification of the genome, with both approaches sequencing fragments of size ∼100 to 500 nucleotides, followed by computational reassembly into the patient’s presumed full genome. Most clinical laboratories then use these libraries directly for sequencing, although some laboratories still use an intermediary polymerase chain reaction (PCR) amplification step to improve coverage for samples with limited DNA. The use of PCR-free cGS requires high quality and quantity of DNA.

At our current state of knowledge, the vast majority of disease-causing genetic variants have been found to lie within genes that code for expressed proteins; this class of genes (the “exome”) accounts for only ∼2% of the total genome (see Chapter 65 ). If one assumes that the patient’s disease-causing genetic variants will be found in one of these coding genes, then the sequencing strategy can target only the exome, and the analysis proceeds using cES. If one wishes to examine not only all the coding genes, but also to preserve the option to examine noncoding genes or intergenic regions at a subsequent analysis should no pathogenic coding variant be found, then the sequencing would proceed as cGS. Note that the use of the terms cES and cGS is increasingly used in the United States, while internationally, the corresponding terms are “Exome Sequencing” and “Whole Genome Sequencing.”

cGS provides more analysis options than cES. Because of better coverage uniformity, large copy number variants (CNVs) are reliably detected at low read depths. CNV determination by genomic sequencing allows CNV characterization at a resolution not possible by other existing platforms. This “low pass” sequencing strategy for CNV data is expected to complement or even replace cytogenomic microarrays as a first-tier test for developmental delay and other indications. If “low pass” sequencing does not identify a causative CNV, generating additional sequences for higher read depths provides a way to search for sequence variants or small indels. Further details of bioinformatics for CNV calling is included below.

Using cES for sequencing is currently cheaper than cGS, mainly due to the difference in the target size, and thus the number of nucleotide bases to be sequenced, between the exome and genome. The subsequent data storage and bioinformatics analysis for cES are similarly simpler and cheaper to perform. However, because of inherent method bias in library construction (referred to as “bait capture” or “amplicon” bias) resulting from which exonic coding fragments are selected, it is necessary to sequence the exome with significant redundancy (typically, one sequences every region at a “depth” of ∼60- to 100-fold). In contrast, using cGS is more expensive than cES, but its comparative lack of sequencing bias and its uniformity of genomic coverage means that a sequencing depth of only ∼30-fold is sufficient to achieve adequate diagnostic sensitivity.

Bioinformatics

Chapter 65 describes many of the programs that are used in genomic testing. We describe here only those additional considerations particularly relevant to clinical genomics.

The data files required for undertaking clinical genomic analyses are large: it requires ∼200 GB to hold the key files (FASTQ, BAM, and VCF) for cGS (at 30-fold depth of coverage), while cES files are smaller (at ∼30 to 50 GB, depending on depth of coverage). Files of this size may require special handling; computer networks need to have sufficient capacity to transmit them, and special error-detection and correction protocols are needed to confirm complete and accurate copying of files. Backup and storage systems also must have sufficient capacity, especially as regulatory and accreditation agencies may require retention of this data for specified extended periods of time, often measured in many years.

The most widely used bioinformatics algorithms (based on the Genome Analysis Toolkit/Best Practices workflow [ https://gatk.broadinstitute.org/hc/en-us ]) are designed to detect single nucleotide variants (SNVs) and small insertions/deletions (indels) up to size 20 nucleotides, and these pipelines and protocols are relatively mature and stable, with analytical sensitivity routinely exceeding 95% for indels, and 99% for SNVs. For larger structural variants (SVs) and copy number variants (CNVs), different or supplementary algorithms are required, and this newer field is still under active development with no established consensus on the optimal algorithms to be used. Note that cGS will have inherently better detection and resolution for CNV/SV than will cES, because cES lacks much of the genome in its sequence coverage: the limit of detection of CNV by cES is typically limited to one to two exons in size, while the limit of detection of CNV by cGS has been clinically validated down to ∼500 nucleotides with diagnostic sensitivity exceeding 95%, and in research settings can be as low as ∼50 nucleotides but with lower diagnostic sensitivity.

Storage of genomic information also may need additional security strategies implemented to protect data from potential loss or improper access. For example, to lessen or prevent unauthorized access (e.g., by a “hacker”), it is usual practice to store the genomes without any personally identifying information (such as names, dates of birth, health record numbers, etc.), and to store this personally identifying information separately and in a completely different computer system. In this way, an unauthorized breach of either system on its own would not be sufficient to enable matching of a genome with the identity of the patient.

Storage policies for types of files and time kept may differ between countries, and within a country between clinical laboratories. The final clinical reports and VCF files may be kept indefinitely because of their small file size. BAM and FASTQ files are much larger and clinical laboratories may not have the capacity to store these indefinitely. FASTQ and/or BAM files are typically kept for several years. Because BAM and VCF files can be regenerated from FASTQ files, retaining only the FASTQ files may be a strategy to reduce storage costs, and storing the FASTQ files using a CRAM lossless compression format may permit further savings. Costs for storage are decreasing, and PHI-secure cloud-based computing may allow long term storage of even the large files. With sequencing costs continuing to decrease, an alternative strategy for the future might involve long-term storage of a portion of the patient’s DNA, allowing for future re-sequencing, possibly with long or phased reads, as technology improves.

Quality control

Sequencing data is first analyzed for quality and coverage. Clinical laboratories establish their quality metrics and acceptance criteria during assay development, which are used for validation to ensure the test reproducibly can meet or exceed the quality standards. Guidelines for validating NGS data pipelines have been published. ,

The first steps are to consider batch metrics, reviewing all samples sequenced together to ensure no issue has occurred at the run level that would affect downstream analysis. While some metrics will indicate that a sample may require further testing, other metrics are “monitor only” and are used for monthly, bi-annual, or annual quality control statistics to look for trends in the data. Such trends may indicate issues with reagents, sequencing flow cells or other issues that may be difficult to see when run metrics are analyzed individually.

Sample-specific metrics that are often monitored include sequencing, alignment, and variant statistics. Given the complexity of the data and that some metrics may be population specific, a sample with one or two outliers may have a biological cause and may not change with resequencing a repeat sample. For this reason, individual metrics outside of the expected or average range are not considered automatic failures. However, any sample having multiple metrics outside the expected range should be investigated. Examples of metrics are given in Table 66.1 ; note that this is not a comprehensive list and clinical laboratories may monitor additional metrics.

TABLE 66.1
Examples of Quality Metrics That May Be Monitored
Metrics Description Possible Action
Sequencing Statistics
Pass-filter yield Typically >105 Gbp for 30× coverage If low, add sequencing reads
%≥Q30 The percentage of bases with a Phred-scale quality greater or equal to 30 Consider 2nd library
Alignment Statistics
% Aligned Low % aligned may indicate contamination or bacterial DNA in the sample (seen in saliva/cheek swab samples). Consider 2nd library, 2nd extraction of original sample or 2nd sample
% Duplicate High duplication rate may indicate a clustering issue and result in reduced coverage 2nd library prep if additional metrics are out
Median insert size Monitor If low, consider 2nd library prep
% Discordant Percentage of aligned reads in which each fragment aligned to different chromosomes or at inappropriate distances If high, consider 2nd library prep
Mean/median coverage 28× for a 30× genome If low, consider additional sequencing: if high, other metrics may be affected
% Coverage Percent coverage at different read depths such as 1×, 20×, or 30× read depths If low, consider additional sequencing
Variant Statistics
(SNVs, Deletions, or Duplications Have Different Metrics)
Ti/Tv ratio Transition/transversion ratio. May indicate artifactual variants in high GC rich regions; possible increase in false positive calls If significantly different that typical range, consider 2nd library prep
Het/Hom ratios May be population specific, or indicate contamination, chimerism, mosaicism Assess whether the issue occurs genome-wide or is chromosome specific
SNV , Single nucleotide variant.

Genomic data can also be used to check the possibility of a sample mix-up, contamination and relatedness. Although short tandem repeat (STR) testing is easier and less expensive to determine identity, contamination and parentage, if trios (proband and parents) are sequenced, genomic data can be used for these purposes without the additional use of STRs. The correlation of Y chromosome specific variants to the reported sex of the patient is a simple first check to ensure sample integrity. If parents and/or siblings of the affected proband are tested to aid interpretation, genomic data can also predict and confirm family relationships and kindred pedigrees (parent-child or siblings). First degree relatives are expected to share 50% of the genome while second degree relatives share 25%. Evaluating parentage is particularly important when reporting de novo variants. Relatedness analysis may reveal regions of homozygosity or other incidental findings of suspected consanguinity. Laboratories should have policies for managing any discrepancies found. Professional societies such as American College of Medical Genetics and Genomics (ACMG) can help give direction on how to manage these sensitive issues. ,

You're Reading a Preview

Become a Clinical Tree membership for Full access and enjoy Unlimited articles

Become membership

If you are a member. Log in here