VariantAnnotation
VariantAnnotation copied to clipboard
readVcf is Slow if ScanVcfParam which Regions is Lengthy
Recently, U.S. Food and Drug Administration, National Institute of Standards and Technology and Illumina researchers defined highly reproducible regions (H.R.R.s) of the human genome and made available BED files to define a set of regions in which variant callers consistently call variants on technical replicate samples, effectively defining a while-list for whole genome sequencing data. readVcf
takes a long time if which
of ScanVcfParam
is specified. Importing the V.C.F. takes about five minutes if which
not specified but I terminated it after one hour when which
was specified.
library(rtracklayer)
goldStandards <- list.files("HRR/", "bed", full.names = TRUE)
goldStandards <- lapply(goldStandards, import.bed)
goldStandards <- unlist(goldStandards)
goldStandards <- reduce(goldStandards)
> goldStandards
GRanges object with 2988875 ranges and 0 metadata columns:
... ...
variants <- readVcf("DRAGENgermline.vcf.gz", param = ScanVcfParam(which = goldStandards)) # Stopped after one hour.
Importing the whole V.C.F. file into the session and then using subsetByOverlaps
seems a reasonable and fast workaround. To best help other users, would a documentation change or code change be better to make this user experience nicer?
Interesting observation @DarioS . This topic (using the HRR resource) could merit a workflow package, which could then be referenced by VariantAnnotation documentation. Would you be up for this? A public VCF could be used to demonstrate the approach. Or a PR to the VariantAnnotation doc could be a rapid resolution of your concern. Did you consider breaking up goldStandards and performing the read on sorted, modest-sized, single-chromosome chunks?
Does the filterVcf vignette address this use case?
Indeed, Section 5.2 of filterVcf cignette recommends readVcf
followed by findOverlaps
. But Section 7 Appendix seems to advocate for using ScanVcfParam(which = theRegions)
. It's unclear which approach is better and for which scenarios.