VariantAnnotation icon indicating copy to clipboard operation
VariantAnnotation copied to clipboard

Convert `Number=A` to `Number=1` when creating an `ExpandedVCF`

Open LTLA opened this issue 6 months ago • 5 comments

It seems reasonable that per-allele ##INFO fields should be converted from Number=A to Number=1 when expand() is called, given that the 1:many relationship between rows and allelic values is now flattened into a 1:1 mapping.

This has practical consequences, too, as we can see:

fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation")

library(VariantAnnotation)
out <- tempfile()
first <- readVcf(fl)
first <- expand(first)

writeVcf(first, out)
roundtrip <- readVcf(out, row.names=FALSE)
roundtrip <- expand(roundtrip)

all.equal(first, roundtrip)
## [1] "Attributes: < Component “assays”: Attributes: < Component “data”: Attributes: < Component “listData”: Component “GT”: Attributes: < Length mismatch: comparison on first 1 components > > > >"
## [2] "Attributes: < Component “assays”: Attributes: < Component “data”: Attributes: < Component “listData”: Component “GQ”: Attributes: < Length mismatch: comparison on first 1 components > > > >"
## [3] "Attributes: < Component “assays”: Attributes: < Component “data”: Attributes: < Component “listData”: Component “DP”: Attributes: < Length mismatch: comparison on first 1 components > > > >"
## [4] "Attributes: < Component “assays”: Attributes: < Component “data”: Attributes: < Component “listData”: Component “HQ”: Attributes: < Length mismatch: comparison on first 1 components > > > >"
## [5] "Attributes: < Component “info”: Attributes: < Component “listData”: Component “AF”: Modes: numeric, S4 > >"
## [6] "Attributes: < Component “info”: Attributes: < Component “listData”: Component “AF”: Attributes: < target is NULL, current is list > > >"
## [7] "Attributes: < Component “info”: Attributes: < Component “listData”: Component “AF”: target is numeric, current is CompressedNumericList > >"
## [8] "Attributes: < Component “metadata”: Component “header”: Attributes: < Component “header”: Attributes: < Component “listData”: Component “fileDate”: Attributes: < Component “listData”: Component “Value”: 1 string mismatch > > > >"

We can ignore mismatches 1-4, as these are inconsequential to most end-users (albeit annoying to developers, see #78). We can also ignore mismatch 8, which is discussed in #78. The interesting discrepancies are that of 5-7, where AF becomes a NumericList after a roundtrip through the VCF file. This is because its ##info is still registering Number=A but should really be Number=1 to match the fact that it's already been flattened by the expand() call to generate first.

Session information
R Under development (unstable) (2023-11-29 r85646)
Platform: aarch64-apple-darwin22.5.0
Running under: macOS Ventura 13.6.1

Matrix products: default
BLAS:   /Users/luna/Software/R/trunk/lib/libRblas.dylib
LAPACK: /Users/luna/Software/R/trunk/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
 [1] VariantAnnotation_1.49.2    Rsamtools_2.19.2
 [3] Biostrings_2.71.1           XVector_0.43.0
 [5] SummarizedExperiment_1.33.1 Biobase_2.63.0
 [7] GenomicRanges_1.55.1        GenomeInfoDb_1.39.5
 [9] IRanges_2.37.0              S4Vectors_0.41.3
[11] MatrixGenerics_1.15.0       matrixStats_1.2.0
[13] BiocGenerics_0.49.1

loaded via a namespace (and not attached):
 [1] KEGGREST_1.43.0          rjson_0.2.21             lattice_0.22-5
 [4] vctrs_0.6.5              tools_4.4.0              bitops_1.0-7
 [7] generics_0.1.3           curl_5.2.0               parallel_4.4.0
[10] tibble_3.2.1             fansi_1.0.6              AnnotationDbi_1.65.2
[13] RSQLite_2.3.4            blob_1.2.4               pkgconfig_2.0.3
[16] Matrix_1.6-4             BSgenome_1.71.1          dbplyr_2.4.0
[19] lifecycle_1.0.4          GenomeInfoDbData_1.2.11  compiler_4.4.0
[22] stringr_1.5.1            progress_1.2.3           codetools_0.2-19
[25] yaml_2.3.8               RCurl_1.98-1.13          pillar_1.9.0
[28] crayon_1.5.2             BiocParallel_1.37.0      DelayedArray_0.29.0
[31] cachem_1.0.8             abind_1.4-5              tidyselect_1.2.0
[34] digest_0.6.33            stringi_1.8.3            restfulr_0.0.15
[37] dplyr_1.1.4              biomaRt_2.59.0           fastmap_1.1.1
[40] grid_4.4.0               cli_3.6.2                SparseArray_1.3.2
[43] magrittr_2.0.3           S4Arrays_1.3.1           GenomicFeatures_1.55.1
[46] XML_3.99-0.16            utf8_1.2.4               rappdirs_0.3.3
[49] filelock_1.0.3           prettyunits_1.2.0        bit64_4.0.5
[52] httr_1.4.7               bit_4.0.5                png_0.1-8
[55] hms_1.1.3                memoise_2.0.1            BiocIO_1.13.0
[58] BiocFileCache_2.11.1     rtracklayer_1.63.0       rlang_1.1.2
[61] glue_1.6.2               DBI_1.2.0                xml2_1.3.6
[64] R6_2.5.1                 GenomicAlignments_1.39.0 zlibbioc_1.49.0

LTLA avatar Jan 02 '24 21:01 LTLA

@hpages are you familiar with expand()?

vjcitn avatar Jan 03 '24 00:01 vjcitn

I have some changes that address this, am testing them now.

vjcitn avatar Jan 04 '24 20:01 vjcitn

Addressed in git.bioconductor.org devel branch with e3103a

vjcitn avatar Jan 10 '24 13:01 vjcitn

also in github ... presumably this change should be ported back to RELEASE_3_18? @mtmorgan ?

vjcitn avatar Jan 10 '24 13:01 vjcitn

personally I'd leave it in devel to provide a chance for any ramifications to materialize... I think this is the commit

mtmorgan avatar Jan 10 '24 14:01 mtmorgan