The importance of enrichment assay choice and optimisation for confident variant detection
Simon Hughes and Daniel Swan
Next generation sequencing (NGS) is now in routine use for a broad range of research and clinical applications. The rapid rate of adoption has been facilitated by falling reagent costs, benchtop instruments, improved chemistries and improved data analysis solutions. However, the cost and complexity of data analysis still remain significant hurdles — particularly for whole genome sequencing. In the majority of cases, targeted approaches, such as whole exome or custom panels, are more cost-effective and generate significantly less, but equally meaningful, data.
Targeted sequencing requires an initial sequence enrichment step, which, if poorly designed, can be a source of bias and error in the downstream sequencing assay1 . This article discusses the main strategies employed to optimise the enrichment step, depending on the type of assay chosen.
Exome or custom panel?
It is best to first consider the choice of sequencing strategy: whole exome or custom targeted panel? In some instances, the answer will be obvious, if one wishes to include non-coding regions, or construct a panel that includes only known, actionable variants, these will not be serviced economically in a whole exome experiment.
Beyond this, however, the choice is dictated by a number of variables:
- The desired depth of sequencing: In heterogeneous samples, such as tumours, where it is important to be able to detect variants present in only a small proportion of the total reads, it is more cost-effective to reach the required depth with a smaller panel
- The total number of samples to be sequenced (if only one or two, it may be more cost-effective to go with an off-the-shelf exome kit)
- The project budget (a small panel is much cheaper to sequence than a whole exome, as many samples can be multiplexed in a single sequencer run)
- The performance requirements: sensitivity, accuracy and complete coverage of all genes can be achieved more effectively with custom designed panels
- For many inherited disorders, a very high “hit rate” for variants of interest can be achieved more cost effectively with a targeted panel of known or candidate genes
- For some clinical research applications, it is desirable to avoid unsolicited findings by focusing only on genes relevant to the disorder under investigation
- Smaller panels can be run on small, less expensive benchtop instruments such as the Illumina MiSeq™, which are more readily accessible to many labs
Which enrichment assay?
Two broad categories of enrichment assays exist: amplicon (PCR) and hybridisation (Figure 1). As a very general rule, hybridisation-based assays, when designed well, offer superior performance. Hybridisation protocols start with random shearing of the DNA, followed by “capture” of the randomlysheared overlapping fragments with long oligonucleotide (oligo) baits. This allows independent sequencing of a large number of unique fragments. Any duplicates (assay artefacts) can be easily identified and removed, leaving high-quality data for analysis. Because the fragments are randomly sheared they should not align perfectly with one another and if they do, they are most certainly duplicates. In addition, enrichment of challenging regions such as GC-rich regions or internal tandem repeats can be optimised by careful positioning and design of baits. Long oligo baits can tolerate sequence variation, so that all alleles of a heterogeneous mix can be captured equally. Amplicon assays require design of primers flanking the region to be amplified. The resulting amplification products are identical, such that duplicates cannot be distinguished from unique products.
Figure 1: Schematic representations of amplicon and hybridisation enrichment approaches. (A) Hybridisation assays begin with random shearing of the genomic DNA, followed by capture using long oligonucleotide baits. Because of this random shearing, fragments captured are overlapping and unique. Baits can be tiled, overlapped and positioned to overcome challenges of repetitive sequences etc. With advanced design, capture can be made very uniform. (B) Amplicon assays provide less flexibility in the positioning and design of primers – primer pairs need to flank the region to be targeted. All fragments generated from a single primer set are identical, with the disadvantage that assay artefacts cannot be distinguished from genuine variation. Primer competition and preferential amplification of some regions over others will lead to non-uniform enrichment.
In making the choice between hybridisation and amplicon approaches, there are several factors which are worth considering (Figure 2).
Figure 2: A number of key factors are important in selection of the most appropriate enrichment assay for a given application.
1. Size of the region to be targeted
- Hybridisation-based assays are amenable to any size of target region, from very small to very large (i.e. whole exome).
- Amplicon assays, although ideal for small numbers of well-defined regions, are challenging to multiplex to any great extent. As the degree of multiplexing and/or the number of PCR cycles increases, so does the tendency towards bias and error. Primer competition and non-uniform amplification of target regions caused by varied GC content or amplicon length contribute to variation in amplification efficiency. Some platforms attempt to overcome this by performing hundreds to thousands of singleplex reactions which are then combined prior to sequencing.
2. Required turnaround time
- Hybridisation assay protocols are more timeconsuming than amplicon approaches. Protocols can take two to three days, and require a number of manual steps. They are readily automated (e.g. Agilent SureSelect™), but require some optimisation for non-standard samples.
- For speed and simplicity, PCR-based approaches are excellent. Protocols are normally fast, only a few hours in duration, with minimal steps involved and use only standard laboratory equipment. It is important to note, however, that PCR itself is the most common source of bias and error in any enrichment assay, so this advantage of speed may need to be balanced with the requirement for high-quality data.
3. Ability to optimise for challenging regions
- Some genes such as FLT3 contain internal tandem duplications that are challenging to target using amplicon approaches because they are by nature repetitive and can be very long. Hybridisation assays offer more scope to position baits and optimise enrichment in these areas.
- Other challenges arise for amplicon assays when novel variants are present within the primer site, as these can result in strand or allelic bias, or even drop out of that region altogether. Hybridisation assays, on the other hand, are less restricted by variant position and can still enrich all strands and alleles equally even in the presence of novel variants.
- Similarly, with regions that are GC-rich, hybridisation baits can still be designed to capture efficiently giving much more uniform coverage than amplicon assays.
4. Robustness with challenging samples
Samples vary in quality and quantity so any assay must be able to deal with a wide range of input DNA types and input quantities:
- FFPE samples: Both hybridisation and amplicon assays can be optimised to perform well with FFPE. The hybridisation-based SureSeq™ Solid Tumour Panel, for example, had a <2% failure rate in a study of >200 clinical samples, so long as the average DNA fragment size was >1000 bp. While PCR methods can be susceptible to contaminants found in FFPE material, ampliconbased assays also tend to work well with FFPE, even with samples where the DNA is heavily fragmented.
- Starting quantity of DNA: Amplicon assays offer a slight advantage in being able to work with smaller quantities of input DNA, often down to <10ng. Hybridisation assays generally require more input DNA — typically >3ug — although well designed hybridisation assays can utilise significantly less (e.g., the SureSeq™ Solid Tumour Panel requires >100ng).
- Duplicates: It is important to note that with small starting quantities, the actual number of templates available is relatively low, so duplication rates increase significantly. With hybridisation assays, these can be removed to leave clean, high-quality data. With amplicon assays, this is not straightforward, and the resulting data may be skewed by the over-amplification of a small number of fragments.
5. Requirement for accuracy and sensitivity
Performance should be a key requirement for all applications: the confidence that the assay will detect all variants present in any region of interest, while avoiding false negatives and false positives. Hybridisation assays offer a number of key benefits which enhance performance:
- Reduced false negatives: The most common reason for false negatives in a targeted sequencing assay is poor coverage at the locus. Hybridisation assays, when well designed, can deliver superior uniformity of enrichment and excellent coverage of all loci, thereby reducing the incidence of false negatives.
- Reduced false positives: The most common cause of false positives are artefacts introduced by PCR polymerases, even when using proofreading enzymes. Hybridisation assays use very few PCR cycles, in comparison to amplicon assays, and therefore the data is less “noisy”.
- Higher detection sensitivity: The reduced “noise” in the hybridisation assay also delivers higher detection sensitivity of variants present at low frequency in the sample.
- Broader scope for discovery of novel as well as known variants: It can be challenging to validate an assay for the discovery of novel variants across large numbers of genes. However, because the ability to detect variants with sensitivity and accuracy largely relies on the depth of coverage at the locus, as well as quality and accuracy of the data, hybridisation-based assays can offer greater confidence in detection of all variants present.
- Price is always an important factor. For larger regions, hybridisation-based panels are very cost effective. For smaller regions, amplicon assays can be more cost effective because the cost of a small number of primers is low. Price must always be considered in parallel with performance requirements for the application of interest and with the cost any additional sequencing required downstream.
The importance of optimising the enrichment assay
As NGS moves towards a clinical setting, the aspiration is clear: confident calling of all variants present with no false negatives and no false positives. The most common reason for missing variants (i.e. false negatives) is lack of coverage at the variant locus due to non-uniform enrichment. The most common source of false positives is the PCR artefacts that are created in the enrichment step. Therefore it is clear that optimisation of the enrichment assay is important.
The Spotlight (below) provides a summary of all of the potential sources of error and bias. As a very broad rule, hybridisation-based assays offer greater opportunity for optimisation through probe design and placement, and they can offer better uniformity of coverage, fewer false positives, and superior variant detection due to fewer PCR cycles. Hybridisationbased assays also offer greater scope in terms of the number of genes and regions that can be targeted.
Uniformity of enrichment as a key metric of performance
The ultimate goal of any sequencing assay is to discover all variants present. Uniformity of enrichment means that all regions are represented more equally, and that variants present in any region will be called. It also allows much lower average sequencing depths to be used, enabling larger numbers of samples to be multiplexed in a run, and significant cost savings.
Uniformity is particularly important when looking at heterogeneous samples. For example tumour mixed with normal tissue, or somatic variants present only within a single clone in a heterogeneous tumour sample where it is essential to have enough reads to confidently call a variant at any given position. It is also important in other high sensitivity applications such as pathogen genome enrichment from host genomic DNA. Figure 3 gives an example of poor coverage due to non-uniform enrichment.
Figure 3: The importance of uniformity for confident detection of variants. This illustration shows a genomic region containing 6 variants (snv1-6). Sequencing reads are shown in grey. In this example, enrichment is not uniform, so some regions are sequenced to greater depth than others. For example, snv2 is not covered by any reads at all, while snv1, snv4 and snv5 are only covered by a small number of reads. Fewer reads makes it challenging to call variants confidently, and is particularly limiting when there are low frequency (e.g. novel somatic) variants present, as illustrated with the “G” allele in snv1. Increasing the average sequencing depth through additional sequencing will help improve coverage at snv1, snv 4 and snv 5 but will not improve coverage at snv2, which is not enriched at all.
Spotlight: Potential sources of bias and error in sequence enrichment assays
Even proof-reading PCR polymerases introduce errors. The likelihood of artefacts (as well as the rate of duplication) increases with increasing PCR cycles. Amplicon-based assays are reliant entirely on PCR, and are potentially more susceptible to artefacts than hybridisation-based assays, which aim to minimise the number of PCR cycles. Hybrid assays such as Agilent HaloPlex™ may also see benefit by reducing the number of PCR cycles overall by combining PCR and hybridisation approaches.
Duplicate reads from a single template
Duplicates are amplification artefacts, normally arising during the library preparation stage. It is highly desirable to remove them prior to data analysis, otherwise some regions may be massively over-represented. In a hybridisationbased assay, duplicate reads can be identified easily and removed. However, in a PCR assay, not only is it impossible to remove duplicates during the analysis step, but it is also true that there are generally more duplicates due to the enrichment step. Duplicates are a particular problem when input quantities of DNA are limiting, because higher numbers of PCR cycles need to be used.
Bias in PCR amplification — PCR drift and PCR selection
PCR is difficult to multiplex where consistent and uniform amplification of each region is desired. There are thought to be two main processes that introduce amplification bias during PCR, PCR selection and PCR drift2 . In PCR selection, some amplicons are favoured and therefore overrepresented due to intrinsic properties of the target sequence, flanking sequences or genome composition. Key contributors to this type of variation include preferential denaturation and amplification of low GC content templates, higher binding efficiency of GC-rich primers (particularly when using degenerate primers) and direct correlation between amplification efficiency and gene copy numbers. PCR drift is assumed to be caused by random interaction of the components of the mix early on in the amplification when the original genomic material is still the main source of template. This type of variation is variable between reactions and more difficult to control. There are ways to optimise PCR that will reduce bias1, 2, including increasing the amount of template, reducing the number of cycles, optimising the instrumentation and performing the multiplex in a number of discrete, lower-plex reactions which are then pooled after amplification.
Variant positional bias in amplicon assays
Variant positional bias is a particularly important consideration for amplicon assays where there is some constraint in choice of position for primer sites. If a SNP or variant happens to fall within the primer site itself, it will not be detected. This is less likely to be an issue with hybridisation-based methods that use long oligonucleotides that can tolerate target sequence variation and can be tiled across the region to be enriched.
Repeat regions and pseudogenes
Repeat regions and pseudogenes represent significant challenges for all enrichment technologies. Depending on the size of the region, hybridisation-based assays may allow better targeting of these regions, by allowing design of baits to flanking regions.
Bias in hybridisation-based assays
Similar to PCR amplification bias, enrichment by hybridisation is prone to bias. This is largely due to the same issues as for PCR amplification, namely GC content and sequence context. It can mean that some regions are not enriched at all and any variants present in those regions will be missed. Hybridisation-based assays offer advantage here as there is more freedom in choice of location for hybridisation probes, and because these probes can be tiled across regions, it is possible to optimise the hybridisation process more easily than for multiplex PCR, maximising the uniformity of enrichment.
The choice of enrichment assay for targeted sequencing assays is an important consideration. No enrichment technology is ideal for every application. The choice will depend largely on the size of region to be targeted, cost of sequencing and the required performance of the assay. Optimisation is also important and hybridisation-based assays offer more scope for superior performance through optimisation of bait design.
Oxford Gene Technology (OGT) offers Genefficiency™ SureSeq and SureSeq Custom NGS Services, a tailored approach to targeted sequencing, providing the most appropriate combination of design, implementation and analysis to deliver high-quality results, whatever your panel size or content. OGT’s Expert Bait Design workflow significantly improves the uniformity of target sequence enrichment (Figure 4), and thereby the sensitivity and accuracy of variant detection across the entire region.
Figure 4: OGT’s Expert Bait Design delivers improved efficiency and uniformity of target sequence capture. (A) A 25-gene custom panel was designed using standard software (COM1-8) and the OGT optimised design algorithm (OGT1-8). With OGT expert design, the percentage of bases sequenced to at least 0.2x mean target coverage (MTC) increased from 73% to >91%. (B) A further design optimisation subsequently improved this to >97.5%.
OGT’s comprehensive NGS services take you from project concept through to high-quality, user-filterable variant files. OGT has validated a number of analysis pipelines, tailored to specific applications, from standard germline variant detection to high sensitivity somatic variant detection in a heterogeneous background. OGT’s unique Genefficiency NGS Variant Analysis Report gives you the freedom to explore and retrospectively interrogate your own data with additional or new selection criteria, without the need for local bioinformatics resource (Figure 5).
Figure 5: The OGT NGS Variant Analysis Report. Data is delivered in an interactive HTML format that allows researchers to filter through all variants, prioritise according to user-defined criteria (e.g. SIFT and Polyphen scores, nonsynonymous mutations, allele frequencies etc) and then link out for more information to tools and databases such as IGV and Ensembl for additional context.
- Aird, D. et al (2011) Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biology12:R18 doi:10.1186/ gb-2011-12-2-r18
- Wagner, A. et al (1994) Surveys of gene families using polymerase chain reaction: PCR selection and PCR drift. Syst Biol 43, 250–261
Genefficiency™ NGS browser: For Research Use Only; Not for Diagnostic Procedures.
Request a technical consultation
Do you have a question about what you've just read? Contact us today and one of our technical specialists will be happy to answer any questions you may have.
Download white paper
Before you can download this please fill in some details.