VariantAnnotation icon indicating copy to clipboard operation
VariantAnnotation copied to clipboard

readVcf is Slow if ScanVcfParam which Regions is Lengthy

Open DarioS opened this issue 1 year ago • 3 comments

Recently, U.S. Food and Drug Administration, National Institute of Standards and Technology and Illumina researchers defined highly reproducible regions (H.R.R.s) of the human genome and made available BED files to define a set of regions in which variant callers consistently call variants on technical replicate samples, effectively defining a while-list for whole genome sequencing data. readVcf takes a long time if which of ScanVcfParam is specified. Importing the V.C.F. takes about five minutes if which not specified but I terminated it after one hour when which was specified.

library(rtracklayer)
goldStandards <- list.files("HRR/", "bed", full.names = TRUE)
goldStandards <- lapply(goldStandards, import.bed)
goldStandards <- unlist(goldStandards)
goldStandards <- reduce(goldStandards)
> goldStandards
GRanges object with 2988875 ranges and 0 metadata columns:
    ...        ...
variants <- readVcf("DRAGENgermline.vcf.gz", param = ScanVcfParam(which = goldStandards)) # Stopped after one hour.

Importing the whole V.C.F. file into the session and then using subsetByOverlaps seems a reasonable and fast workaround. To best help other users, would a documentation change or code change be better to make this user experience nicer?

DarioS avatar Mar 27 '23 10:03 DarioS

Interesting observation @DarioS . This topic (using the HRR resource) could merit a workflow package, which could then be referenced by VariantAnnotation documentation. Would you be up for this? A public VCF could be used to demonstrate the approach. Or a PR to the VariantAnnotation doc could be a rapid resolution of your concern. Did you consider breaking up goldStandards and performing the read on sorted, modest-sized, single-chromosome chunks?

vjcitn avatar Mar 27 '23 10:03 vjcitn

Does the filterVcf vignette address this use case?

mtmorgan avatar Mar 27 '23 20:03 mtmorgan

Indeed, Section 5.2 of filterVcf cignette recommends readVcf followed by findOverlaps. But Section 7 Appendix seems to advocate for using ScanVcfParam(which = theRegions). It's unclear which approach is better and for which scenarios.

DarioS avatar Mar 28 '23 02:03 DarioS