slivar
slivar copied to clipboard
make-gnotate dbSNP change `RS` from `Integer` to `Float`
Since dbSNP vcf is huge (15G) from https://ftp.ncbi.nih.gov/snp/latest_release/VCF/ .
I want to use slivar to annotate RS
##INFO=<ID=RS,Number=1,Type=Integer,Description="dbSNP ID (i.e. rs number)">
use origin file GCF_000001405.25.gz
, I got error message:input should be decomposed and normalized
$ ./slivar make-gnotate --prefix dbSNP154.GRCh37.p13 --field RS:RS ../bgi_anno/test/GCF_000001405.25.gz
> slivar version: 0.1.12 97859c00a4387c869719330521c36d0b12e92247
[slivar] using type int for "RS"
input should be decomposed and normalized
After 'decomposed and normalized' and tabix with
$ bcftools norm -f GRCh37.p13.fasta.gz -m- -w 10000 -O v GCF_000001405.25.gz |bgzip > GCF_000001405.25.norm.GRCh37.p13.vcf.gz
Lines total/split/realigned/skipped:717580549/66535480/11802776/0
$ time tabix -p vcf GCF_000001405.25.norm.GRCh37.p13.vcf.gz
real 28m14.788s
user 28m4.648s
sys 0m8.694s
make-gnotate
got error with SIGSEGV: Illegal storage access. (Attempt to read from nil?)
$ ./slivar make-gnotate --prefix dbSNP154.GRCh37.p13 --field RS:RS ../bgi_anno/test/GCF_000001405.25.norm.GRCh37.p13.vcf.gz
> slivar version: 0.1.12 97859c00a4387c869719330521c36d0b12e92247
SIGSEGV: Illegal storage access. (Attempt to read from nil?)
I guess this may be cause by long chromsome name like 'NC_000001.10', I'll convert to '1' and try again. If this is not the solution, I'll need your help.
Did you show all the output? I just tried:
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/all_assembly_versions/GCF_000001405.25_GRCh37.p13/GCF_000001405.25_GRCh37.p13_genomic.fna.gz
zcat GCF_000001405.25_GRCh37.p13_genomic.fna.gz > g.fasta && samtools faidx g.fasta
wget -O - https://ftp.ncbi.nih.gov/snp/latest_release/VCF/GCF_000001405.25.gz | zcat - | head -2000000 | bcftools norm -f g.fasta -m - -w 10000 -O z -o 1kg.norm.head.vcf.gz -
../slivar make-gnotate --prefix g --field RS:RS 1kg.norm.head.vcf.gz
and that succeeds with last command showing:
> slivar version: 0.1.12 97859c00a4387c869719330521c36d0b12e92247
[W::vcf_parse] Contig 'NC_000001.10' is not defined in the header. (Quick workaround: index the file with tabix.)
[slivar] using type int for "RS"
[slivar] 500000 variants completed. at: NC_000001.10:1803630. exact: 500000 long: 7105 in 1kg.norm.head.vcf.gz
[slivar] 1000000 variants completed. at: NC_000001.10:3268786. exact: 1000000 long: 14883 in 1kg.norm.head.vcf.gz
[slivar] 1500000 variants completed. at: NC_000001.10:5091569. exact: 1500000 long: 21141 in 1kg.norm.head.vcf.gz
[slivar] 2000000 variants completed. at: NC_000001.10:6780808. exact: 2000000 long: 26861 in 1kg.norm.head.vcf.gz
[slivar] kvs.len for NC_000001.10: 2253001 after 1kg.norm.head.vcf.gz
[slivar] writing 2253001 encoded and 29735 long values for chromosome NC_000001.10
[slivar] removed 41 duplicated positions by using the value and chromosome: NC_000001.10
[slivar] wrote g.zip
Note that's not full VCF, but enough to show that it doesn't fail immediately
It may be some io/disk error. I can reproduct in one node tj-software-install, but run without error in other node like tj-login-24-2
By the way, does slivar
or vcfanno
can smartly handle chromosome name like '1' or 'chr1', or even 'NC_000001.10'?
I test farther and got more strange results.
- I use
rsync
to copyGCF_000001405.25.norm.GRCh37.p13.vcf.gz
to current workdir - I use
ln -sf
to create link ofGCF_000001405.25.norm.GRCh37.p13.vcf.gz
totest.vcf.gz
- I use
cp
to copyGCF_000001405.25.norm.GRCh37.p13.vcf.gz
to current workdir and rename to `tt.vcf.gz'
It seems that:
- In nomal node tj-login-24-2. only comand
./slivar make-gnotate --prefix dbSNP154.GRCh37.p13 --field RS:RS GCF_000001405.25.norm.GRCh37.p13.vcf.gz
got error while./slivar make-gnotate --prefix dbSNP154.GRCh37.p13 --field RS:RS ./GCF_000001405.25.norm.GRCh37.p13.vcf.gz
and other comands run well. - In strange node tj-software-install:
only comand
./slivar make-gnotate --prefix dbSNP154.GRCh37.p13 --field RS:RS ./tt.vcf.gz
runs well
since dbSNP has 765,506,328 variants (count from log of make-gnotate
), the zip file has size of 4.3G.
dbSNP RS annoted result like 9.03331e+08
. It seems I should use vcfanno
instead.
#CHROM POS ID REF ALT QUAL FILTER INFO
1 865595 CM1613956 A G . . CLASS=DM?;MUT=ALT;GENE=SAMD11;STRAND=+;DNA=NM_152486.2:c.133A>G;PROT=NP_689699.2:p.K45E;DB=rs903331232;PHEN="Retinitis_pigmentosa";RANKSCORE=0.21;PMID=27734943;ACCID=CM1613956;GnomAD_AF=1.00217e-05;GnomAD_EAS_AF=0;GnomAD_Hom=0;GnomAD_EAS_Hom=0;1000G_AF=-1;1000G_EAS_AF=-1;ExAC_AF=-1;ExAC_EAS_AF=-1;ExAC_Hom=-1;ExAC_EAS_Hom=-1;RS=9.03331e+08
1 874491 CM1613954 C T . . CLASS=DM?;MUT=ALT;GENE=SAMD11;STRAND=+;DNA=NM_152486.2:c.502C>T;PROT=NP_689699.2:p.R168*;DB=rs1441881282;PHEN="Retinitis_pigmentosa";RANKSCORE=0.99;PMID=27734943;ACCID=CM1613954;GnomAD_AF=4.56696e-06;GnomAD_AF_filter;GnomAD_EAS_AF=0;GnomAD_EAS_AF_filter;GnomAD_Hom=0;GnomAD_Hom_filter;GnomAD_EAS_Hom=0;GnomAD_EAS_Hom_filter;1000G_AF=-1;1000G_EAS_AF=-1;ExAC_AF=-1;ExAC_EAS_AF=-1;ExAC_Hom=-1;ExAC_EAS_Hom=-1;RS=1.44188e+09
1 877523 CM1511864 C G . . CLASS=DM?;MUT=ALT;GENE=SAMD11;STRAND=+;DNA=NM_152486.2:c.877C>G;PROT=NP_689699.2:p.P293A;DB=rs200195897;PHEN="Autism_spectrum_disorder";RANKSCORE=0.1;PMID=26204995|30276537;ACCID=CM1511864;GnomAD_AF=0.000358429;GnomAD_EAS_AF=0;GnomAD_Hom=0;GnomAD_EAS_Hom=0;1000G_AF=-1;1000G_EAS_AF=-1;ExAC_AF=0.0012822;ExAC_EAS_AF=0;ExAC_Hom=0;ExAC_EAS_Hom=0;RS=2.00196e+08
1 879286 CS1613955 A C . . CLASS=DM?;MUT=ALT;GENE=SAMD11;STRAND=+;DNA=NM_152486.2:c.1801-2A>C;DB=rs770278990;PHEN="Retinitis_pigmentosa";RANKSCORE=0.16;PMID=27734943;ACCID=CS1613955;GnomAD_AF=4.06782e-06;GnomAD_EAS_AF=0;GnomAD_Hom=0;GnomAD_EAS_Hom=0;1000G_AF=-1;1000G_EAS_AF=-1;ExAC_AF=8.58e-06;ExAC_EAS_AF=0;ExAC_Hom=0;ExAC_EAS_Hom=0;RS=7.70279e+08
1 879375 CM1613953 C T . . CLASS=DM;MUT=ALT;GENE=SAMD11;STRAND=+;DNA=NM_152486.2:c.1888C>T;PROT=NP_689699.2:p.R630*;DB=rs761448939;PHEN="Retinitis_pigmentosa";RANKSCORE=0.49;PMID=27734943|28991257;ACCID=CM1613953;GnomAD_AF=4.43706e-05;GnomAD_EAS_AF=5.47585e-05;GnomAD_Hom=0;GnomAD_EAS_Hom=0;1000G_AF=-1;1000G_EAS_AF=-1;ExAC_AF=1.7153e-05;ExAC_EAS_AF=0;ExAC_Hom=0;ExAC_EAS_Hom=0;RS=7.61449e+08
I'm not sure if you got this working, or how to solve the issue, but one thing to note is that slivar gnotate is limited to 32 bit integers. I think the RS numbers may go past that. I have a plan to improve this, but it is a ways out.
slivar
can speed up 10 time for dbSNP RS
, but I think vcfanno
is fast enough with parallel.
$ time ./slivar expr -v clinvar_20200728.vcf.gz -g dbSNP154.zip > clinvar_20200728.vcf.gz.RS.vcf
> slivar version: 0.1.12 97859c00a4387c869719330521c36d0b12e92247
[slivar] 0 samples matched in VCF and PED to be evaluated
[slivar] message for dbSNP154.zip:
> created on:2020-09-14
[slivar] 10000 1:21896828 evaluated 10000 variants in 7.2 seconds (1390.5/second)
[slivar] 20000 1:53675751 evaluated 10000 variants in 0.1 seconds (151302.4/second)
[slivar] 100000 2:166153562 evaluated 100000 variants in 7.1 seconds (14096.3/second)
[slivar] 200000 4:169799311 evaluated 100000 variants in 9.7 seconds (10263.4/second)
[slivar] 300000 7:124465318 evaluated 100000 variants in 11.7 seconds (8533.3/second)
[slivar] 400000 11:2593262 evaluated 100000 variants in 13.0 seconds (7704.7/second)
[slivar] 500000 13:110850841 evaluated 100000 variants in 6.7 seconds (14961.7/second)
[slivar] Finished. evaluated 771098 total variants and wrote 771098 variants that passed your slivar expressions.
real 1m16.338s
user 1m9.289s
sys 0m5.862s
$ time vcfanno -p 6 dbSNP.mini.toml clinvar_20200728.vcf.gz > clinvar_20200728.RS.vcf
=============================================
vcfanno version 0.3.2 [built with go1.14.4]
see: https://github.com/brentp/vcfanno
=============================================
vcfanno.go:115: found 1 sources from 1 files
vcfanno.go:145: using 2 worker threads to decompress bgzip file
api.go:811: WARNING: using op 'self' when with Number='1' for 'RS' from 'dbSNP154.vcf.gz.RS.vcf.gz' can result in out-of-order values when the query is multi-allelic
api.go:812: : this is not an issue if the query has been decomposed.
^[[Bvcfanno.go:248: annotated 771098 variants in 285.07 seconds (2704.9 / second)
real 4m45.591s
user 11m54.235s
sys 0m29.055s
dbSNP RS annoted result like
9.03331e+08
.
This could be a bug. what does the header show for RS? Integer? or Float?
from dbSNP, it's Integer
in header
##INFO=<ID=RS,Number=1,Type=Integer,Description="dbSNP ID (i.e. rs number)">
in annotated vcf, it changed to Float
##INFO=<ID=RS,Number=1,Type=Float,Description="field from from gnotate VCF">
when make-gnotate, It say using type int for "RS"
$ time ./slivar make-gnotate --prefix dbSNP154.debug --field RS:RS ../bgi_anno/test/dbSNP154.lite.vcf.gz
> slivar version: 0.1.12 97859c00a4387c869719330521c36d0b12e92247
[slivar] using type int for "RS"
[slivar] 500000 variants completed. at: 1:1803630. exact: 500000 long: 7105 in ../bgi_anno/test/dbSNP154.lite.vcf.gz
[slivar] 1000000 variants completed. at: 1:3268786. exact: 1000000 long: 14883 in ../bgi_anno/test/dbSNP154.lite.vcf.gz