Background
Accurate variant calling in high-throughput data is a critical step before any interpretation process. There is a large variety of bioinformatic software and approaches for detecting genetic variants and different strategies to analyse clinical samples.
The quality of the input in AION influences the quality of the AION results. In this guideline we discuss the best practices for variant calling in clinical sequencing studies, with a particular emphasis on trio sequencing for inherited disorders. We do not want to conclude on which tools are the best ones, but to determine which strategies are the most recommended regardless of the bioinformatic tool selected.
As AION takes as input variant called files (VCF) generated by different variant callers (e.g. different customers or users), we have assumed VCF files are generated by following these best practices. It is also assumed that the format is supported by AION (see more details here)
Best practices for germline variant calling
There are dozens of bioinformatic tools to detect SNVs and INDELs, and countless more have been developed by researchers for internal use. Some of them are specialised in Whole Exome Sequencing (WES), other in Whole Genome Sequencing (WGS) or genomes, and many other in customised gene panels. Each variant caller is recommended for specific uses. Many benchmarking workflows have been performed and published so far (Krishnan et al 2021, Zhao et al 2020) comparing different variant callers on golden WGS trios available at the Genome In A Bottle (GIAB) consortium.
Individual versus joint variant calling
All variant callers can be applied to individual samples after alignment and preprocessing are complete. It should be noted that VCF files typically only contain entries for positions that are different in a particular sample. This is, when a variant is only detected in some samples but not others, it is not clear whether the other samples are wild type for that position (GT == 0/0
) or simply did not achieve sufficient coverage or other quality control for the variant caller to make a call.
Joint variant calling - considering all samples simultaneously during variant calling offers several key advantages:
- First, it produces called genotypes for every sample at all variant positions, not just the ones that were detected in one only individual. This makes it possible to differentiate between a position that matches the reference sequence with high probability and a position in which the sample did not achieve sufficient coverage.
- Second, in the case of trio sequencing, joint calling enables direct inference of phase information to establish, for example, whether two heterozygous variants in a proband are in cis or in trans.
- Third, it mitigates the issue of variant representation differences which might otherwise be problematic, particularly for complex variants.
- Finally, joint analysis allows a variant caller to use information from one sample to infer the most likely genotype in another, which has been shown to increase the sensitivity of variant calling in low-coverage regions.
CNV variant calling
Copy Number Variation (CNV) variant calling using short-read sequencing technology, particularly in targeted panels or Whole Exome Sequencing (WES), presents several challenges.
The primary difficulty arises from the limited read length, which can complicate the accurate detection of CNVs, especially those spanning repetitive regions or having complex breakpoints. Short reads often fail to map uniquely to the reference genome in these areas, leading to ambiguous alignments and potential misidentification of CNV boundaries. Additionally, the depth of coverage is variable, further complicating the differentiation between true CNVs and sequencing noise.
To enhance the quality of VCF files and reduce artifacts and false positives, several strategies can be employed:
- Ensuring high and uniform coverage across the target regions.
- Use multiple CNV detection algorithms to cross-validate calls.
- Leverage a normalisation cohort from the same run, or using the same library prep, to reduce common artifacts.
Ultimately, the quality of the CNV variants depends on the secondary analysis step. AION helps in interpreting CNVs, but its efficacy depends on quality of the secondary analysis.
AION assumptions
When implementing our pipeline to annotate, classify, and prioritize variants from any VCF file, AION has been developed assuming that VCF files followed bioinformatics best practices, allowing better sensitivity towards variant detection.
References
[1] Koboldt 2020 (Genome Medicine)