VariantAnnotation icon indicating copy to clipboard operation
VariantAnnotation copied to clipboard

`readVcf` doesn't handle quoted `=` in `hapmap_exome_chr22.vcf.gz` correctly

Open LTLA opened this issue 1 year ago • 8 comments

The header line for GATKCommandLine causes problems:

##GATKCommandLine=<ID=ApplyRecalibration,Version=3.3-0-g37228af,Date="Thu May 07 02:55:12 EDT 2015",Epoch=1430981712637,CommandLineOptions="analysis_type=ApplyRecalibration input_file=[] showFullBamList=false read_buffer_size=null phone_home
=NO_ET gatk_key=/isilon/sequencing/CIDRSeqSuiteSoftware/gatk/GATK_2/lee.watkins_jhmi.edu.key tag=NA read_filter=[] intervals=null excludeIntervals=null interval_set_rule=UNION interval_merging=ALL interval_padding=0 reference_sequence=/isilo
n/sequencing/GATK_resource_bundle/1.5/b37/human_g1k_v37_decoy.fasta nonDeterministicRandomSeed=false disableDithering=false maxRuntime=-1 maxRuntimeUnits=MINUTES downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=
1000 baq=OFF baqGapOpenPenalty=40.0 refactor_NDN_cigar_string=false fix_misencoded_quality_scores=false allow_potentially_misencoded_quality_scores=false useOriginalQualities=false defaultBaseQualities=-1 performanceLog=null BQSR=null quanti
ze_quals=0 disable_indel_quals=false emit_original_quals=false preserve_qscores_less_than=6 globalQScorePrior=-1.0 validation_strictness=SILENT remove_program_records=false keep_program_records=false sample_rename_mapping_file=null unsafe=nu
ll disable_auto_index_creation_and_locking_when_reading_rods=true no_cmdline_in_header=false sites_only=false never_trim_vcf_format_field=false bcf=false bam_compression=null simplifyBAM=false disable_bam_indexing=false generate_md5=false nu
m_threads=1 num_cpu_threads_per_data_thread=1 num_io_threads=0 monitorThreadEfficiency=false num_bam_file_handles=null read_group_black_list=null pedigree=[] pedigreeString=[] pedigreeValidationType=STRICT allow_intervals_with_unindexed_bam=
false generateShadowBCF=false variant_index_type=DYNAMIC_SEEK variant_index_parameter=-1 logging_level=INFO log_to_file=null help=false version=false input=[(RodBinding name=input source=/isilon/sequencing/Seq_Proj//Amos_LungCa_SeqWholeExome
Plus_040413_1/MULTI_SAMPLE/Amos_LungCa_SeqWholeExomePlus_040413_1.GATK-3.3-0.2050samples.raw.HC.INDEL.vcf)] recal_file=(RodBinding name=recal_file source=/isilon/sequencing/Seq_Proj//Amos_LungCa_SeqWholeExomePlus_040413_1/MULTI_SAMPLE/Amos_L
ungCa_SeqWholeExomePlus_040413_1.GATK-3.3-0.2050samples.HC.INDEL.recal) tranches_file=/isilon/sequencing/Seq_Proj/Amos_LungCa_SeqWholeExomePlus_040413_1/MULTI_SAMPLE/Amos_LungCa_SeqWholeExomePlus_040413_1.GATK-3.3-0.2050samples.HC.INDEL.tran
ches out=org.broadinstitute.gatk.engine.io.stubs.VariantContextWriterStub ts_filter_level=99.0 lodCutoff=null ignore_filter=null ignore_all_filters=false excludeFiltered=false mode=INDEL filter_reads_with_N_cigar=false filter_mismatching_bas
e_and_quals=false filter_bases_not_stored=false">

Calling readVcf gives me:

fl <- system.file("extdata", "hapmap_exome_chr22.vcf.gz", package="VariantAnnotation")
library(VariantAnnotation)
first <- readVcf(fl)
meta(header(first))$GATKCommandLine$CommandLineOptions
## [1] "\"analysis_type"

Oops, look like it just cut off after the = inside the quote. This causes no end of problems during a write round-trip:

writeVcf(first, out)
roundtrip <- readVcf(out, row.names=FALSE)
## [W::bcf_hdr_parse_line] Incomplete header line, trying to proceed anyway:
## 	[##GATKCommandLine=<ID=ApplyRecalibration,Version=3.3-0-g37228af,Date="Thu May 07 02:55:12 EDT 2015",Epoch=1430981712637,CommandLineOptions="analysis_type>
Session information
R Under development (unstable) (2023-11-29 r85646)
Platform: aarch64-apple-darwin22.5.0
Running under: macOS Ventura 13.6.1

Matrix products: default
BLAS:   /Users/luna/Software/R/trunk/lib/libRblas.dylib
LAPACK: /Users/luna/Software/R/trunk/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
 [1] VariantAnnotation_1.49.2    Rsamtools_2.19.2
 [3] Biostrings_2.71.1           XVector_0.43.0
 [5] SummarizedExperiment_1.33.1 Biobase_2.63.0
 [7] GenomicRanges_1.55.1        GenomeInfoDb_1.39.5
 [9] IRanges_2.37.0              S4Vectors_0.41.3
[11] MatrixGenerics_1.15.0       matrixStats_1.2.0
[13] BiocGenerics_0.49.1

loaded via a namespace (and not attached):
 [1] KEGGREST_1.43.0          rjson_0.2.21             lattice_0.22-5
 [4] vctrs_0.6.5              tools_4.4.0              bitops_1.0-7
 [7] generics_0.1.3           curl_5.2.0               parallel_4.4.0
[10] tibble_3.2.1             fansi_1.0.6              AnnotationDbi_1.65.2
[13] RSQLite_2.3.4            blob_1.2.4               pkgconfig_2.0.3
[16] Matrix_1.6-4             BSgenome_1.71.1          dbplyr_2.4.0
[19] lifecycle_1.0.4          GenomeInfoDbData_1.2.11  compiler_4.4.0
[22] stringr_1.5.1            progress_1.2.3           codetools_0.2-19
[25] yaml_2.3.8               RCurl_1.98-1.13          pillar_1.9.0
[28] crayon_1.5.2             BiocParallel_1.37.0      DelayedArray_0.29.0
[31] cachem_1.0.8             abind_1.4-5              tidyselect_1.2.0
[34] digest_0.6.33            stringi_1.8.3            restfulr_0.0.15
[37] dplyr_1.1.4              biomaRt_2.59.0           fastmap_1.1.1
[40] grid_4.4.0               cli_3.6.2                SparseArray_1.3.2
[43] magrittr_2.0.3           S4Arrays_1.3.1           GenomicFeatures_1.55.1
[46] XML_3.99-0.16            utf8_1.2.4               rappdirs_0.3.3
[49] filelock_1.0.3           prettyunits_1.2.0        bit64_4.0.5
[52] httr_1.4.7               bit_4.0.5                png_0.1-8
[55] hms_1.1.3                memoise_2.0.1            BiocIO_1.13.0
[58] BiocFileCache_2.11.1     rtracklayer_1.63.0       rlang_1.1.2
[61] glue_1.6.2               DBI_1.2.0                xml2_1.3.6
[64] R6_2.5.1                 GenomicAlignments_1.39.0 zlibbioc_1.49.0

LTLA avatar Jan 02 '24 21:01 LTLA

Not sure if this helps. If we use TabixFile(fl) for readVcf, the round trip will work. Is the embedded "=" possible in a valid VCF?

vjcitn avatar Jan 02 '24 21:01 vjcitn

The problem seems to be in Rsamtools:::.bcfHeaderAsSimpleList. There is a regular expression check that excludes a header line containing a period: header line beginning GATKCommandLine.SelectVariants is excluded, IMHO erroneously. Pressing on...

vjcitn avatar Jan 04 '24 11:01 vjcitn

Somewhere in here, in the !idx lines, there is mishandling of embedded '='

    rex <- "^([[:alnum:]]+)=<(.*)>"
    lines <- grep(rex, h, value = TRUE)
    if (length(lines) > 0) {
        if (any(duplicated(lines))) {
            warning("duplicate INFO and FORMAT header lines will be ignored")
            lines <- lines[duplicated(lines) == FALSE]
        }
    }
    tags <- sub(rex, "\\1", lines)
    keyval0 <- sub(rex, "\\2", lines)
    keyval1 <- rep(NA_character_, length(keyval0))
    keyval <- list()
    idx <- tags %in% c("INFO", "FORMAT", "FILTER", "ALT")
    keyval1[idx] <- strsplit(keyval0[idx], ",(?=(ID|Number|Type)=[[:alnum:]]*)|,(?=Description=\".*?\")", 
        perl = TRUE)
    keyval[idx] <- lapply(which(idx), function(i, keyval1) strsplit(keyval1[[i]], 
        "(?<=[ID|Number|Type|Description])=", perl = TRUE), keyval1)
    keyval1[!idx] <- strsplit(keyval0[!idx], ",(?=[[:alnum:]]+=)", 
        perl = TRUE)
    keyval[!idx] <- lapply(which(!idx), function(i, keyval1) {
        strsplit(keyval1[[i]], "(?<=[[:alnum:]])=", perl = TRUE)
    }, keyval1)

vjcitn avatar Jan 04 '24 11:01 vjcitn

The issue comes down to parsing this record (or any like it) into a nice DataFrame ... the code didn't anticipate a record of this type and it clearly fails to process it correctly, truncating the info after analysis_type. I do not have time right now to fix it.

##GATKCommandLine.SelectVariants=<ID=SelectVariants,Version=3.4-46-gbc02625,Date="Wed Sep 23 12:04:45 GMT-08:00 2015",Epoch=1443038685033,CommandLineOptions="analysis_type=SelectVariants input_file=[] showFullBamList=false read_buffer_size=null phone_home=AWS gatk_key=null tag=NA read_filter=[] disable_read_filter=[] intervals=null excludeIntervals=null interval_set_rule=UNION interval_merging=ALL interval_padding=0 reference_sequence=/projects/cidr/Amos/amos_cidr/Analysis_Pipeline_Files/human_g1k_v37_decoy.fasta nonDeterministicRandomSeed=false disableDithering=false maxRuntime=-1 maxRuntimeUnits=MINUTES downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=1000 baq=OFF baqGapOpenPenalty=40.0 refactor_NDN_cigar_string=false fix_misencoded_quality_scores=false allow_potentially_misencoded_quality_scores=false useOriginalQualities=false defaultBaseQualities=-1 performanceLog=null BQSR=null quantize_quals=0 disable_indel_quals=false emit_original_quals=false preserve_qscores_less_than=6 globalQScorePrior=-1.0 validation_strictness=SILENT remove_program_records=false keep_program_records=false sample_rename_mapping_file=null unsafe=null disable_auto_index_creation_and_locking_when_reading_rods=false no_cmdline_in_header=false sites_only=false never_trim_vcf_format_field=false bcf=false bam_compression=null simplifyBAM=false disable_bam_indexing=false generate_md5=false num_threads=1 num_cpu_threads_per_data_thread=1 num_io_threads=0 monitorThreadEfficiency=false num_bam_file_handles=null read_group_black_list=null pedigree=[] pedigreeString=[] pedigreeValidationType=STRICT allow_intervals_with_unindexed_bam=false generateShadowBCF=false variant_index_type=DYNAMIC_SEEK variant_index_parameter=-1 logging_level=INFO log_to_file=null help=false version=false variant=(RodBinding name=variant source=/projects/cidr/Amos/amos_cidr/MultiSampleVCF/withGenotypeRefinement_Redone/Amos_LungCa_SeqWholeExomePlus_040413_1.GATK-3.3-0.2047samples.BEDsuperset.VQSR.1KG.ExAC3.REFINED.06192015.vcf.gz) discordance=(RodBinding name= source=UNBOUND) concordance=(RodBinding name= source=UNBOUND) out=/projects/geneva/gcc-fs2/stephanie/Code/Bioconductor/amos_example_data/hapmap_exome_chr22.vcf.gz sample_name=[] sample_expressions=null sample_file=[amos_hapmaps.txt] exclude_sample_name=[] exclude_sample_file=[] exclude_sample_expressions=[] selectexpressions=[] invertselect=false excludeNonVariants=true excludeFiltered=false preserveAlleles=true removeUnusedAlternates=false restrictAllelesTo=ALL keepOriginalAC=false keepOriginalDP=false mendelianViolation=false invertMendelianViolation=false mendelianViolationQualThreshold=0.0 select_random_fraction=0.0 remove_fraction_genotypes=0.0 selectTypeToInclude=[] selectTypeToExclude=[] keepIDs=amos_chr22_subset.txt excludeIDs=null fullyDecode=false justRead=false maxIndelSize=2147483647 minIndelSize=0 maxFilteredGenotypes=2147483647 minFilteredGenotypes=0 maxFractionFilteredGenotypes=1.0 minFractionFilteredGenotypes=0.0 setFilteredGtToNocall=false ALLOW_NONOVERLAPPING_COMMAND_LINE_SAMPLES=false filter_reads_with_N_cigar=false filter_mismatching_base_and_quals=false filter_bases_not_stored=false">

vjcitn avatar Jan 04 '24 18:01 vjcitn

I consulted the standard, section 1.4, where it says

The values of optional fields must be written as quoted strings, even for numeric values.

I wrote three test cases that seem to be consistent with the standard

h = c(
    'key1=<ID=foo1,clo="a=1,i=[],s=false",h=baz>',
    'key2=<ID=foo2,clo="a=1 i=[] s=false",h=baz>',
    'key3=<ID=foo3, date="Thu May 07", clo="a=1,i=[],s=false",h=baz>'
)

the tag / key-value pairs are extracted like in the code @vjcitn notes

rex <- "^([[:alnum:]]+)=<(.*)>"
tags <- sub(rex, "\\1", h)
keyval0 <- sub(rex, "\\2", h)

Within keyval0, it seems we want to find a 'tag' [[:alnum:]]+, an equal sign =, and EITHER an alphanumeric value [[:alnum:]]+ OR a quoted (with double quotes?) string "[^"]+". So extract tags and values

rex0 <- '[[:alnum:]]+=([[:alnum:]]+|"[^"]+")'
m <- gregexpr('[[:alnum:]]+=([[:alnum:]]+|"[^"]+")', keyval0)
keyval1 <- regmatches(keyval0, m)

This gives us the correct (?) key/value pairs for each test case

> keyval1
[[1]]
[1] "ID=foo1"                  "clo=\"a=1,i=[],s=false\""
[3] "h=baz"                   

[[2]]
[1] "ID=foo2"                  "clo=\"a=1 i=[] s=false\""
[3] "h=baz"                   

[[3]]
[1] "ID=foo3"                  "date=\"Thu May 07\""     
[3] "clo=\"a=1,i=[],s=false\"" "h=baz"                   

and the individual tags / values can be extracted with relatively simple regular expressions

rex1 <- '^([[:alnum:]]+)=(.*)'
## nested tags
lapply(keyval1, sub, pattern = rex1, replacement = "\\1")
## nested keys
lapply(keyval1, sub, pattern = rex1, replacement = "\\2")

The example @LTLA provides seems to violate the standard, since Version=3.3-0-g37228af is an optional string, and it is not quoted. I added a test case.

h = c(
    'key1=<ID=foo1,clo="a=1,i=[],s=false",h=baz>',
    'key2=<ID=foo2,clo="a=1 i=[] s=false",h=baz>',
    'key3=<ID=foo3, date="Thu May 07", clo="a=1,i=[],s=false",h=baz>',
    'key4=<ID=foo3, version=3.14-15, date="Thu May 07", clo="a=1,i=[],s=false",h=baz>'
)

We could relax the pattern for finding nested values, e.g., by matching EITHER any string not containing a comma [^,]+ OR a quoted string "[^"]+"

rex0 <- '[[:alnum:]]+=([^,]+|"[^"]+")'

mtmorgan avatar Jan 04 '24 20:01 mtmorgan

Thanks for looking into this. If you have time to introduce this enhancement, that would be great. Otherwise, eventually I will study and attempt to do it.

vjcitn avatar Jan 04 '24 20:01 vjcitn

Another simple fix would be to just update hapmap_exome_chr22.vcf.gz so that it doesn't have this non-compliant header. I only stumbled across this because I was rummaging around VariantAnnotation's VCFs to find some test files.

IMO if the file isn't standards-compliant, then perhaps we shouldn't work too hard to accommodate it.

LTLA avatar Jan 05 '24 20:01 LTLA

I am going to take the step of removing offending records from hapmap_exome_chr22.vcf.gz now.

vjcitn avatar Jan 10 '24 11:01 vjcitn