During the secondary analysis processing, multiple QC tools are used to gather insights of the quality of the raw data and the secondary analysis process. The metrics from different QC tools are gathered in a MultiQC (1) html format report , that is available for download after the case has finished processing in the case card in the secondary analysis user interface:
General statistics
Raw sequence data quality
For analysing the quality of the raw sequence data, we use FastQC (2), a widely used quality control tool for evaluating the quality of the raw read data from NGS sequencing experiments. It provides a comprehensive report on the quality of FASTQ files, offering visualisations and metrics that help identify potential issues in sequencing results before proceeding with downstream analysis.
Raw reads quality data (FastQC) metrics:
- Sequence Counts
- Sequence Quality Histograms
- Per Sequence Quality Scores
- Per Base Sequence Content
- Per Sequence GC Content
- Per Base N Content
- Sequence Length Distribution
- Sequence Duplication Levels
- Overrepresented sequences by sample
- Top overrepresented sequences
- Adapter Content
- Status Checks
Alignment quality (Sentieon / qualimap)
For analysing the quality of the alignment and the resulting BAM files, metrics from both Sentieon (3) the alignment tool, and a specific BAM qc tool Qualimap (4) are reported. This sections of the QC report allows user to understand basic statistics related read mapping and coverage as well as GC content, duplication of reads and more.
Alignment quality data (Qualimap BamQC) metrics:
- Coverage histogram
- Cumulative genome coverage
- Insert size histogram
- GC content distribution
Alignment quality data (Sentieon):
- Alignment Summary
- Mean read length
- GC Coverage Bias
- Mark Duplicates
- Mean Base Quality by Cycle
- Base Quality Distribution
VCF quality
For analysing the quality of the variants Bcftools is used. Bcftools (5) is designed for variant calling and the manipulation of Variant Call Format (VCF) and Binary Call Format (BCF) files. These formats are essential for storing information about variants found in genomic data, including single nucleotide polymorphisms (SNPs) and insertions/deletions (indels). Bcftools provides a comprehensive set of tools that allow users to perform various tasks, such as filtering variants based on quality metrics, merging multiple VCF or BCF files, and generating summary statistics to assess the quality and distribution of variants across samples. By utilizing Bcftools, researchers can effectively manage and analyze their variant data, ensuring that only high-quality variants are considered in downstream analyses.
Small variants quality and stats (Bcftools stats):
- Variant Substitution Types
- Variant Quality
- Indel Distribution
- Variant depths
The report also includes a section for CNV quality and stats in which the main relevant metrics for CNVs in terms of variant type and size are summarized.
Exporting Data
Data from the quality report can be easily exported as a full report or partial downloads on relevant statistics. The report provides flexibility in the download allowing the user to select the information/tracks that wants to download and the format (plots or tables).
QC Report Examples
References:
1) Ewels, P., Magnusson, M., Lundin, S., & Käller, M. (2016). MultiQC: Summarize analysis results for multiple tools and samples in a single report. Bioinformatics, 32(19), 3047-3048. https://doi.org/10.1093/bioinformatics/btw354
2) Andrews, S. (2010). FastQC: A quality control tool for high throughput sequence data. Available online at: http://www.bioinformatics.babraham.ac.uk/projects/fastqc (Accessed: [2024/06/25]).
3) Kendig, K. I., Baheti, S., Bockol, M. A., Drucker, T. M., Hart, S. N., Heldenbrand, J. R., Hernaez, M., Hudson, M. E., Kalmbach, M. T., Klee, E. W., Mattson, N. R., Ross, C. A., Taschuk, M., Wieben, E. D., Wiepert, M., Wildman, D. E., & Mainzer, L. S. (2019). Sentieon DNASeq variant calling workflow demonstrates strong computational performance and accuracy. Frontiers in Genetics, 10, 736. https://doi.org/10.3389/fgene.2019.00736
4) García-Alcalde, F., Okonechnikov, K., Carbonell, J., Cruz, L. M., Götz, S., Tarazona, S., Dopazo, J., Meyer, T. F., & Conesa, A. (2012). Qualimap: evaluating next-generation sequencing alignment data. Bioinformatics, 28(20), 2678-2679. https://doi.org/10.1093/bioinformatics/bts503
5) Danecek, P., Bonfield, J. K., Liddle, J., Marshall, J., Ohan, V., Pollard, M. O., Whitwham, A., Keane, T., McCarthy, S. A., & Davies, R. M. (2021). Twelve years of SAMtools and BCFtools. GigaScience, 10(2), giab008. https://doi.org/10.1093/gigascience/giab008