SeqArray
SeqArray copied to clipboard
seqGDS2VCF error after using seqMerge
After having merged some gds files I am getting an error with seqGDS2VCF due to seqSummary() This stems from the .summary_filter() function were the attributes of the merged gds seems to somehow get twisted.
traceback()
4: stop(gettextf("arguments imply differing number of rows: %s",
paste(unique(nrows), collapse = ", ")), domain = NA)
3: data.frame(ID = id, Description = dp, stringsAsFactors = FALSE)
2: .summary_filter(gdsfile, check, verbose)
1: seqSummary(gdsmerged, check = "none", verbose = FALSE)
Filter attribute in one of the original gds files
$R.class
[1] "factor"
$R.levels
[1] "PASS" "MONOALLELIC"
$Description
[1] "All filters passed"
[2] "Site represents one ALT allele in a region with multiple variants that could not be unified into non-overlapping multi-allelic sites"
$md5
[1] "dde0f98a641ade2203cb5ab84d037576"
Filter attribute in the merged gds
$Description
[1] "All filters passed"
[2] "Site represents one ALT allele in a region with multiple variants that could not be unified into non-overlapping multi-allelic sites"
[3] "All filters passed"
[4] "Site represents one ALT allele in a region with multiple variants that could not be unified into non-overlapping multi-allelic sites"
[5] "All filters passed"
[6] "Site represents one ALT allele in a region with multiple variants that could not be unified into non-overlapping multi-allelic sites"
[7] "All filters passed"
[8] "Site represents one ALT allele in a region with multiple variants that could not be unified into non-overlapping multi-allelic sites"
[9] "All filters passed"
[10] "Site represents one ALT allele in a region with multiple variants that could not be unified into non-overlapping multi-allelic sites"
[11] "All filters passed"
[12] "Site represents one ALT allele in a region with multiple variants that could not be unified into non-overlapping multi-allelic sites"
[13] "All filters passed"
[14] "Site represents one ALT allele in a region with multiple variants that could not be unified into non-overlapping multi-allelic sites"
[15] "All filters passed"
[16] "Site represents one ALT allele in a region with multiple variants that could not be unified into non-overlapping multi-allelic sites"
[17] "All filters passed"
[18] "Site represents one ALT allele in a region with multiple variants that could not be unified into non-overlapping multi-allelic sites"
[19] "All filters passed"
[20] "Site represents one ALT allele in a region with multiple variants that could not be unified into non-overlapping multi-allelic sites"
[21] "All filters passed"
[22] "Site represents one ALT allele in a region with multiple variants that could not be unified into non-overlapping multi-allelic sites"
$md5
[1] "754cf550034a5297723ac2e7ac510f9c"
sessionInfo()
R version 4.0.0 (2020-04-24)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS/LAPACK: /services/tools/intel/perflibs/2019/compilers_and_libraries_2019.0.117/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so
locale:
[1] LC_CTYPE=en_US.utf-8 LC_NUMERIC=C
[3] LC_TIME=en_US.utf-8 LC_COLLATE=en_US.utf-8
[5] LC_MONETARY=en_US.utf-8 LC_MESSAGES=en_US.utf-8
[7] LC_PAPER=en_US.utf-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.utf-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] GMMAT_1.3.1 SeqVarTools_1.28.0 SeqArray_1.30.0 gdsfmt_1.26.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.5 pillar_1.4.7 compiler_4.0.0
[4] bigparallelr_0.3.0 GenomeInfoDb_1.26.0 XVector_0.30.0
[7] bitops_1.0-6 iterators_1.0.13 tools_4.0.0
[10] zlibbioc_1.36.0 tibble_3.0.4 lifecycle_0.2.0
[13] nlme_3.1-147 lattice_0.20-41 mgcv_1.8-31
[16] logistf_1.24 pkgconfig_2.0.3 rlang_0.4.9
[19] Matrix_1.2-18 foreach_1.5.1 flock_0.7
[22] parallel_4.0.0 RhpcBLASctl_0.20-137 CompQuadForm_1.4.3
[25] GenomeInfoDbData_1.2.4 bigassertr_0.1.3 dplyr_1.0.2
[28] generics_0.1.0 vctrs_0.3.5 Biostrings_2.58.0
[31] S4Vectors_0.28.0 IRanges_2.24.0 tidyselect_1.1.0
[34] stats4_4.0.0 grid_4.0.0 data.table_1.13.4
[37] glue_1.4.2 mice_3.12.0 Biobase_2.50.0
[40] R6_2.5.0 GWASExactHW_1.01 BiocParallel_1.24.1
[43] tidyr_1.1.2 purrr_0.3.4 magrittr_2.0.1
[46] Rsamtools_2.6.0 backports_1.2.0 ellipsis_0.3.1
[49] codetools_0.2-16 operator.tools_1.6.3 formula.tools_1.7.1
[52] BiocGenerics_0.36.0 GenomicRanges_1.42.0 splines_4.0.0
[55] RCurl_1.98-1.2 doParallel_1.0.16 broom_0.7.2
[58] crayon_1.3.4
Hi Zheng Will you be addressing this anytime soon? Just need to know cause otherwise I will make a quick fix myself so I can move on.
Bw, Gustav
A quick fix could be:
f <- seqOpen("your_gds_file.gds", readonly=F)
node <- index.gdsn(f, "annotation/filter")
(s <- get.attr.gdsn(node)$Description)
put.attr.gdsn(node, "Description", unique(s))
seqClose(f)
Ah yes. Thank you!
In your case, did you merge multiple gds files with different variants? or multiple gds files with different samples?
different variants. Actually merging the 200K UKB exome vcfs (gds).
seqGDS2VCF is also printing "." filter as NA which cause some downstream error since NA is not declared in the header. And I don't know how to rewrite the header in the gds without messing something up. Also the contigs are converted to 'chrX' -> 'X' (which I prefer) but that is not changed in contigs listed in the header, which also cause some downstream errors. I tried to fix using write.gdsn() for "description/vcf.contig/ID" but it doesn't work.
Sorry, solved the last by just using add.gdsn instead of write.gdsn. And I added NA as a filterlevel in filter attribute which seem to work
"seqGDS2VCF is also printing "." filter as NA." It is fixed in the development version: http://www.bioconductor.org/packages/devel/bioc/news/SeqArray/NEWS