FilterVcf.R fails with NAs in BQ field
Describe the issue
When NAs are found in the BQ field .readAndCheckVcf crashes when trying to log a warning. The following if-statement is the culprit, since n is never defined:
if (is.null(bq) || all(is.na(bq))) {
vcf <- .addBqField(vcf, error)
} else if (any(is.na(bq))) {
flog.warn("BQ FORMAT field contains NAs. Removing %i variants.", n - length(vcf))
}
Furthermore, when using an ensembled VCF (eg. from bcbio.variation.recall) the BQ of a variant can be found in different places depending on what caller called the variant, but .getBQFromVcfwill only use the first one it finds and consider the BQ of all other callers as NA. I suppose this could be "fixed" by changing the VCF such that all variants use the same field for BQ, but it would be nice if this was somehow handled internally.
To Reproduce
Rscript $PURECN/PureCN.R \
--sampleid=$SAMPLE \
--seg-file=$SEGMENTS \
--log-ratio-file=$LOGRATIO \
--mapping-bias-file=$MAPPINGBIAS \
--tumor=$TUMOR \
--vcf=$ENSEMBLED_VCF \
--genome=hg19
--out=$OUTPUT/$SAMPLE
Where $ENSEMBLED_VCF contains variants called by multiple callers (eg. freebayes and mutect2)
Expected behavior PureCN functions normally and uses variants from all callers.
Log file
INFO [2022-08-12 11:14:41] Loading PureCN 2.2.0...
INFO [2022-08-12 11:14:42] Reading cnv_sv/gatk_cnv_denoise_read_counts/C021_T.clean.denoisedCR.tsv...
INFO [2022-08-12 11:14:42] Reading cnv_sv/gatk_cnv_collect_read_counts/C021_T.counts.hdf5...
INFO [2022-08-12 11:15:02] ------------------------------------------------------------
INFO [2022-08-12 11:15:02] PureCN 2.2.0
INFO [2022-08-12 11:15:02] ------------------------------------------------------------
INFO [2022-08-12 11:15:02] Arguments: -normal.coverage.file -seg.file cnv_sv/gatk_cnv_model_segments/C021_T.clean.modelFinal.seg -vcf.file output/snv_indels/bcbio_variation_recall_ensemble/C021_T.ensembled.vcf.gz -normalDB -genome hg19 -sex ? -args.setPriorVcf 6 -args.setMappingBiasVcf /home/xdemer/data/normaldb/purecn_normaldb/mapping_bias_twist_hg19.rds -args.filterIntervals 100,0.05 -args.segmentation 0.005,NULL, -sampleid C021_T -min.ploidy 1.4 -max.ploidy 6 -max.non.clonal 0.2 -max.homozygous.loss 0.05,1e+07 -log.ratio.calibration 0.1 -model.homozygous FALSE -error 0.001 -interval.file -min.logr.sdev 0.15 -max.segments 300 -plot.cnv TRUE -vcf.field.prefix PureCN. -cosmic.vcf.file -DB.info.flag DB -POPAF.info.field POP_AF -Cosmic.CNT.info.field Cosmic.CNT -model beta -post.optimize TRUE -BPPARAM -log.file cnv_sv/purecn/C021_T.log -tumor.coverage.file <data> -log.ratio <data> -args.filterVcf <data> -fun.segmentation <data> -test.num.copy <data> -test.purity <data> -speedup.heuristics <data>
INFO [2022-08-12 11:15:02] Loading coverage files...
WARN [2022-08-12 11:15:02] Allosome coverage missing, cannot determine sex.
WARN [2022-08-12 11:15:02] Allosome coverage missing, cannot determine sex.
INFO [2022-08-12 11:15:03] Using 10958 intervals (10958 on-target, 0 off-target).
INFO [2022-08-12 11:15:03] No off-target intervals. If this is hybrid-capture data, consider adding them.
INFO [2022-08-12 11:15:03] Loading VCF...
INFO [2022-08-12 11:15:08] Found 43585 variants in VCF file.
INFO [2022-08-12 11:15:08] Maximum of POPAP INFO is > 1, assuming -log10 scaled values
WARN [2022-08-12 11:15:08] vcf.file has no DB info field for membership in germline databases. Found and used valid population allele frequency > 0.001000 instead.
WARN [2022-08-12 11:15:09] DB INFO flag contains NAs
INFO [2022-08-12 11:15:09] 722 (1.7%) variants annotated as likely germline (DB INFO flag).
Error in layout(level, msg, ...) : object 'n' not found
Calls: runAbsoluteCN ... .readAndCheckVcf -> flog.warn -> .log_level -> layout
In addition: Warning message:
In for (i in seq_len(n)) { :
closing unused connection 3 (cnv_sv/gatk_cnv_denoise_read_counts/C021_T.clean.denoisedCR.tsv)
Execution halted
Thanks Erik, will take care of this soon!
I think I fixed the crash in devel. Supporting multiple BQ fields is probably not something I will add anytime soon. The VCF standardization code is already quite a mess supporting all the different VCF formats. BQ is not super critical information for the likelihood model, especially if you filter by it upstream.