The SNV statistics does not match with bcftools stats
I see that maftools is inaccurately estimating the number of T>C and C>T and other statistics in SNV Class plot. I am also attaching the VCF file and MAF file for your reference.
Command Please post your commands and the output (errors or any unexpected output) BCFtools output:
bcftools stats temp-snpeff.vcf
[W::bcf_hrec_check] Invalid tag name: "B)QB"
# This file was produced by bcftools stats (1.17+htslib-1.17) and can be plotted using plot-vcfstats.
# The command line was: bcftools stats temp-snpeff.vcf
#
# Definition of sets:
# ID [2]id [3]tab-separated file names
ID 0 temp-snpeff.vcf
# SN, Summary numbers:
# number of records .. number of data rows in the VCF
# number of no-ALTs .. reference-only sites, ALT is either "." or identical to REF
# number of SNPs .. number of rows with a SNP
# number of MNPs .. number of rows with a MNP, such as CC>TT
# number of indels .. number of rows with an indel
# number of others .. number of rows with other type, for example a symbolic allele or
# a complex substitution, such as ACT>TCGA
# number of multiallelic sites .. number of rows with multiple alternate alleles
# number of multiallelic SNP sites .. number of rows with multiple alternate alleles, all SNPs
#
# Note that rows containing multiple types will be counted multiple times, in each
# counter. For example, a row with a SNP and an indel increments both the SNP and
# the indel counter.
#
# SN [2]id [3]key [4]value
SN 0 number of samples: 1
SN 0 number of records: 736
SN 0 number of no-ALTs: 0
SN 0 number of SNPs: 690
SN 0 number of MNPs: 46
SN 0 number of indels: 0
SN 0 number of others: 0
SN 0 number of multiallelic sites: 0
SN 0 number of multiallelic SNP sites: 0
# TSTV, transitions/transversions:
# TSTV [2]id [3]ts [4]tv [5]ts/tv [6]ts (1st ALT) [7]tv (1st ALT) [8]ts/tv (1st ALT)
TSTV 0 610 80 7.62 610 80 7.62
# SiS, Singleton stats:
# SiS [2]id [3]allele count [4]number of SNPs [5]number of transitions [6]number of transversions [7]number of indels [8]repeat-consistent [9]repeat-inconsistent [10]not applicable
SiS 0 1 690 610 80 0 0 0 0
# AF, Stats by non-reference allele frequency:
# AF [2]id [3]allele frequency [4]number of SNPs [5]number of transitions [6]number of transversions [7]number of indels [8]repeat-consistent [9]repeat-inconsistent [10]not applicable
AF 0 0.000000 690 610 80 0 0 0 0
# QUAL, Stats by quality
# QUAL [2]id [3]Quality [4]number of SNPs [5]number of transitions (1st ALT) [6]number of transversions (1st ALT) [7]number of indels
QUAL 0 0.0 690 610 80 0
# IDD, InDel distribution:
# IDD [2]id [3]length (deletions negative) [4]number of sites [5]number of genotypes [6]mean VAF
# ST, Substitution types:
# ST [2]id [3]type [4]count
ST 0 A>C 12
ST 0 A>G 132
ST 0 A>T 14
ST 0 C>A 14
ST 0 C>G 7
ST 0 C>T 170
ST 0 G>A 135
ST 0 G>C 4
ST 0 G>T 8
ST 0 T>A 17
ST 0 T>C 173
ST 0 T>G 4
# DP, Depth distribution
# DP [2]id [3]bin [4]number of genotypes [5]fraction of genotypes (%) [6]number of sites [7]fraction of sites (%)
But the maftools plotmafSummary function says that T>C and C>T mutations are 305 in number but they are not (I also subset MAF file manually and check the number of rows). Now of variants in the samples are also reported less.
library(maftools)
mafs = list.files(path = ".", pattern = "*.maf$", full.names = TRUE)
maf_dat = lapply(mafs, data.table::fread, sep = "\t", skip = "Hugo_Symbol")
names(maf_dat) = unlist(data.table::tstrsplit(basename(path = mafs), spli = "\\.", keep = 1))
maf = data.table::rbindlist(l = maf_dat, use.names = TRUE, fill = TRUE, idcol = "maf_file")
maf = maftools::read.maf(maf, removeDuplicatedVariants = FALSE)
plotmafSummary(maf = maf, rmOutlier = FALSE, addStat = 'median', dashboard = TRUE, titvRaw = FALSE)
Session info
Run sessionInfo() and post the output below
> sessionInfo()
R version 4.5.0 (2025-04-11)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.5 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Asia/Riyadh
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] maftools_2.24.0 dplyr_1.1.4 readr_2.1.5
loaded via a namespace (and not attached):
[1] tidyr_1.3.1 generics_0.1.3 lattice_0.22-5 stringi_1.8.7 hms_1.1.3 digest_0.6.37
[7] magrittr_2.0.3 grid_4.5.0 RColorBrewer_1.1-3 pkgload_1.4.0 fastmap_1.2.0 Matrix_1.7-3
[13] plyr_1.8.9 cellranger_1.1.0 processx_3.8.6 pkgbuild_1.4.7 sessioninfo_1.2.3 survival_3.8-3
[19] urlchecker_1.0.1 ps_1.9.1 promises_1.3.2 BiocManager_1.30.25 purrr_1.0.4 cli_3.6.4
[25] shiny_1.10.0 rlang_1.1.6 crayon_1.5.3 splines_4.5.0 ellipsis_0.3.2 bit64_4.6.0-1
[31] remotes_2.5.0 withr_3.0.2 cachem_1.1.0 DNAcopy_1.82.0 devtools_2.4.5 tools_4.5.0
[37] parallel_4.5.0 tzdb_0.5.0 memoise_2.0.1 httpuv_1.6.16 BiocGenerics_0.54.0 curl_6.2.2
[43] vctrs_0.6.5 R6_2.6.1 mime_0.13 stats4_4.5.0 lifecycle_1.0.4 stringr_1.5.1
[49] S4Vectors_0.46.0 fs_1.6.6 htmlwidgets_1.6.4 bit_4.6.0 usethis_3.1.0 vroom_1.6.5
[55] miniUI_0.1.2 pkgconfig_2.0.3 desc_1.4.3 callr_3.7.6 pillar_1.10.2 later_1.4.2
[61] data.table_1.17.0 glue_1.8.0 profvis_0.4.0 Rcpp_1.0.14 tibble_3.2.1 tidyselect_1.2.1
[67] rstudioapi_0.17.1 xtable_1.8-4 htmltools_0.5.8.1 mgsub_1.7.3 compiler_4.5.0 readxl_1.4.5
@hpages any input would be appreciated.
Hi,
Sorry for the delay.
-
maftools classifies single-nucleotide variants into Transitions and Transversions It's not the same as how bcftools reports for individual conversions.
-
Please note the section for Transitions and Transversions in the bcftools output
# TSTV, transitions/transversions:
# TSTV [2]id [3]ts [4]tv [5]ts/tv [6]ts (1st ALT) [7]tv (1st ALT) [8]ts/tv (1st ALT)
TSTV 0 610 80 7.62 610 80 7.62
These numbers do match what you see in maftools output (T>C + C>T = 610 transitions), and the rest transversion events add up to 80.
I hope this helps.
@PoisonAlien But the third graph (SNV Class) represents only individual substitutions correct? It appears that MAFtools sums up all A>G and T>C i.e. 132+173=305 and plot it as T>C and similarly it sums up G>A and C>t and plot it as C>T. Correct? And I a guessing that it was done to reduce the plot size? If that's the case maybe it's worth mentioning in the documentation since it got me puzzled up.
This issue is stale because it has been open for 60 days with no activity.
This issue was closed because it has been inactive for 14 days since being marked as stale.