slivar icon indicating copy to clipboard operation
slivar copied to clipboard

make-gnotate dbSNP change `RS` from `Integer` to `Float`

Open liserjrqlxue opened this issue 3 years ago • 10 comments

Since dbSNP vcf is huge (15G) from https://ftp.ncbi.nih.gov/snp/latest_release/VCF/ . I want to use slivar to annotate RS

##INFO=<ID=RS,Number=1,Type=Integer,Description="dbSNP ID (i.e. rs number)">

use origin file GCF_000001405.25.gz, I got error message:input should be decomposed and normalized

$ ./slivar make-gnotate --prefix dbSNP154.GRCh37.p13 --field RS:RS ../bgi_anno/test/GCF_000001405.25.gz
> slivar version: 0.1.12 97859c00a4387c869719330521c36d0b12e92247
[slivar] using type int for "RS"
input should be decomposed and normalized

After 'decomposed and normalized' and tabix with

$ bcftools norm -f GRCh37.p13.fasta.gz -m- -w 10000 -O v GCF_000001405.25.gz |bgzip > GCF_000001405.25.norm.GRCh37.p13.vcf.gz
Lines   total/split/realigned/skipped:717580549/66535480/11802776/0

$ time tabix -p vcf GCF_000001405.25.norm.GRCh37.p13.vcf.gz

real	28m14.788s
user	28m4.648s
sys	0m8.694s

make-gnotate got error with SIGSEGV: Illegal storage access. (Attempt to read from nil?)

$ ./slivar make-gnotate --prefix dbSNP154.GRCh37.p13 --field RS:RS ../bgi_anno/test/GCF_000001405.25.norm.GRCh37.p13.vcf.gz
> slivar version: 0.1.12 97859c00a4387c869719330521c36d0b12e92247
SIGSEGV: Illegal storage access. (Attempt to read from nil?)

I guess this may be cause by long chromsome name like 'NC_000001.10', I'll convert to '1' and try again. If this is not the solution, I'll need your help.

liserjrqlxue avatar Sep 14 '20 02:09 liserjrqlxue

Did you show all the output? I just tried:

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/all_assembly_versions/GCF_000001405.25_GRCh37.p13/GCF_000001405.25_GRCh37.p13_genomic.fna.gz
zcat GCF_000001405.25_GRCh37.p13_genomic.fna.gz > g.fasta && samtools faidx g.fasta

wget -O - https://ftp.ncbi.nih.gov/snp/latest_release/VCF/GCF_000001405.25.gz | zcat - | head -2000000 | bcftools norm -f g.fasta -m - -w 10000 -O z -o 1kg.norm.head.vcf.gz -

../slivar make-gnotate --prefix g --field RS:RS 1kg.norm.head.vcf.gz

and that succeeds with last command showing:

> slivar version: 0.1.12 97859c00a4387c869719330521c36d0b12e92247
[W::vcf_parse] Contig 'NC_000001.10' is not defined in the header. (Quick workaround: index the file with tabix.)
[slivar] using type int for "RS"
[slivar] 500000 variants completed. at: NC_000001.10:1803630. exact: 500000 long: 7105 in 1kg.norm.head.vcf.gz
[slivar] 1000000 variants completed. at: NC_000001.10:3268786. exact: 1000000 long: 14883 in 1kg.norm.head.vcf.gz
[slivar] 1500000 variants completed. at: NC_000001.10:5091569. exact: 1500000 long: 21141 in 1kg.norm.head.vcf.gz
[slivar] 2000000 variants completed. at: NC_000001.10:6780808. exact: 2000000 long: 26861 in 1kg.norm.head.vcf.gz
[slivar] kvs.len for NC_000001.10: 2253001 after 1kg.norm.head.vcf.gz
[slivar] writing 2253001 encoded and 29735 long values for chromosome NC_000001.10
[slivar] removed 41 duplicated positions by using the value and chromosome: NC_000001.10
[slivar] wrote g.zip

brentp avatar Sep 14 '20 03:09 brentp

Note that's not full VCF, but enough to show that it doesn't fail immediately

brentp avatar Sep 14 '20 03:09 brentp

It may be some io/disk error. I can reproduct in one node tj-software-install, but run without error in other node like tj-login-24-2 捕获

By the way, does slivar or vcfanno can smartly handle chromosome name like '1' or 'chr1', or even 'NC_000001.10'?

liserjrqlxue avatar Sep 14 '20 08:09 liserjrqlxue

捕获

I test farther and got more strange results.

  1. I use rsync to copy GCF_000001405.25.norm.GRCh37.p13.vcf.gz to current workdir
  2. I use ln -sf to create link of GCF_000001405.25.norm.GRCh37.p13.vcf.gz to test.vcf.gz
  3. I use cp to copy GCF_000001405.25.norm.GRCh37.p13.vcf.gz to current workdir and rename to `tt.vcf.gz'

It seems that:

  • In nomal node tj-login-24-2. only comand ./slivar make-gnotate --prefix dbSNP154.GRCh37.p13 --field RS:RS GCF_000001405.25.norm.GRCh37.p13.vcf.gz got error while ./slivar make-gnotate --prefix dbSNP154.GRCh37.p13 --field RS:RS ./GCF_000001405.25.norm.GRCh37.p13.vcf.gz and other comands run well.
  • In strange node tj-software-install: only comand ./slivar make-gnotate --prefix dbSNP154.GRCh37.p13 --field RS:RS ./tt.vcf.gz runs well

liserjrqlxue avatar Sep 14 '20 08:09 liserjrqlxue

since dbSNP has 765,506,328 variants (count from log of make-gnotate), the zip file has size of 4.3G.

liserjrqlxue avatar Sep 14 '20 09:09 liserjrqlxue

dbSNP RS annoted result like 9.03331e+08. It seems I should use vcfanno instead.

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
1	865595	CM1613956	A	G	.	.	CLASS=DM?;MUT=ALT;GENE=SAMD11;STRAND=+;DNA=NM_152486.2:c.133A>G;PROT=NP_689699.2:p.K45E;DB=rs903331232;PHEN="Retinitis_pigmentosa";RANKSCORE=0.21;PMID=27734943;ACCID=CM1613956;GnomAD_AF=1.00217e-05;GnomAD_EAS_AF=0;GnomAD_Hom=0;GnomAD_EAS_Hom=0;1000G_AF=-1;1000G_EAS_AF=-1;ExAC_AF=-1;ExAC_EAS_AF=-1;ExAC_Hom=-1;ExAC_EAS_Hom=-1;RS=9.03331e+08
1	874491	CM1613954	C	T	.	.	CLASS=DM?;MUT=ALT;GENE=SAMD11;STRAND=+;DNA=NM_152486.2:c.502C>T;PROT=NP_689699.2:p.R168*;DB=rs1441881282;PHEN="Retinitis_pigmentosa";RANKSCORE=0.99;PMID=27734943;ACCID=CM1613954;GnomAD_AF=4.56696e-06;GnomAD_AF_filter;GnomAD_EAS_AF=0;GnomAD_EAS_AF_filter;GnomAD_Hom=0;GnomAD_Hom_filter;GnomAD_EAS_Hom=0;GnomAD_EAS_Hom_filter;1000G_AF=-1;1000G_EAS_AF=-1;ExAC_AF=-1;ExAC_EAS_AF=-1;ExAC_Hom=-1;ExAC_EAS_Hom=-1;RS=1.44188e+09
1	877523	CM1511864	C	G	.	.	CLASS=DM?;MUT=ALT;GENE=SAMD11;STRAND=+;DNA=NM_152486.2:c.877C>G;PROT=NP_689699.2:p.P293A;DB=rs200195897;PHEN="Autism_spectrum_disorder";RANKSCORE=0.1;PMID=26204995|30276537;ACCID=CM1511864;GnomAD_AF=0.000358429;GnomAD_EAS_AF=0;GnomAD_Hom=0;GnomAD_EAS_Hom=0;1000G_AF=-1;1000G_EAS_AF=-1;ExAC_AF=0.0012822;ExAC_EAS_AF=0;ExAC_Hom=0;ExAC_EAS_Hom=0;RS=2.00196e+08
1	879286	CS1613955	A	C	.	.	CLASS=DM?;MUT=ALT;GENE=SAMD11;STRAND=+;DNA=NM_152486.2:c.1801-2A>C;DB=rs770278990;PHEN="Retinitis_pigmentosa";RANKSCORE=0.16;PMID=27734943;ACCID=CS1613955;GnomAD_AF=4.06782e-06;GnomAD_EAS_AF=0;GnomAD_Hom=0;GnomAD_EAS_Hom=0;1000G_AF=-1;1000G_EAS_AF=-1;ExAC_AF=8.58e-06;ExAC_EAS_AF=0;ExAC_Hom=0;ExAC_EAS_Hom=0;RS=7.70279e+08
1	879375	CM1613953	C	T	.	.	CLASS=DM;MUT=ALT;GENE=SAMD11;STRAND=+;DNA=NM_152486.2:c.1888C>T;PROT=NP_689699.2:p.R630*;DB=rs761448939;PHEN="Retinitis_pigmentosa";RANKSCORE=0.49;PMID=27734943|28991257;ACCID=CM1613953;GnomAD_AF=4.43706e-05;GnomAD_EAS_AF=5.47585e-05;GnomAD_Hom=0;GnomAD_EAS_Hom=0;1000G_AF=-1;1000G_EAS_AF=-1;ExAC_AF=1.7153e-05;ExAC_EAS_AF=0;ExAC_Hom=0;ExAC_EAS_Hom=0;RS=7.61449e+08

liserjrqlxue avatar Sep 14 '20 09:09 liserjrqlxue

I'm not sure if you got this working, or how to solve the issue, but one thing to note is that slivar gnotate is limited to 32 bit integers. I think the RS numbers may go past that. I have a plan to improve this, but it is a ways out.

brentp avatar Sep 14 '20 14:09 brentp

slivar can speed up 10 time for dbSNP RS, but I think vcfanno is fast enough with parallel.

$ time ./slivar expr -v clinvar_20200728.vcf.gz -g dbSNP154.zip > clinvar_20200728.vcf.gz.RS.vcf
> slivar version: 0.1.12 97859c00a4387c869719330521c36d0b12e92247
[slivar] 0 samples matched in VCF and PED to be evaluated
[slivar] message for dbSNP154.zip:
   > created on:2020-09-14
[slivar] 10000 1:21896828 evaluated 10000 variants in 7.2 seconds (1390.5/second)
[slivar] 20000 1:53675751 evaluated 10000 variants in 0.1 seconds (151302.4/second)
[slivar] 100000 2:166153562 evaluated 100000 variants in 7.1 seconds (14096.3/second)
[slivar] 200000 4:169799311 evaluated 100000 variants in 9.7 seconds (10263.4/second)
[slivar] 300000 7:124465318 evaluated 100000 variants in 11.7 seconds (8533.3/second)
[slivar] 400000 11:2593262 evaluated 100000 variants in 13.0 seconds (7704.7/second)
[slivar] 500000 13:110850841 evaluated 100000 variants in 6.7 seconds (14961.7/second)
[slivar] Finished. evaluated 771098 total variants and wrote 771098 variants that passed your slivar expressions.

real	1m16.338s
user	1m9.289s
sys	0m5.862s
$ time vcfanno -p 6 dbSNP.mini.toml clinvar_20200728.vcf.gz > clinvar_20200728.RS.vcf

=============================================
vcfanno version 0.3.2 [built with go1.14.4]

see: https://github.com/brentp/vcfanno
=============================================
vcfanno.go:115: found 1 sources from 1 files
vcfanno.go:145: using 2 worker threads to decompress bgzip file
api.go:811: WARNING: using op 'self' when with Number='1' for 'RS' from 'dbSNP154.vcf.gz.RS.vcf.gz' can result in out-of-order values when the query is multi-allelic
api.go:812:        : this is not an issue if the query has been decomposed.
^[[Bvcfanno.go:248: annotated 771098 variants in 285.07 seconds (2704.9 / second)

real	4m45.591s
user	11m54.235s
sys	0m29.055s

liserjrqlxue avatar Sep 15 '20 02:09 liserjrqlxue

dbSNP RS annoted result like 9.03331e+08.

This could be a bug. what does the header show for RS? Integer? or Float?

brentp avatar Sep 15 '20 15:09 brentp

from dbSNP, it's Integer in header

##INFO=<ID=RS,Number=1,Type=Integer,Description="dbSNP ID (i.e. rs number)">

in annotated vcf, it changed to Float

##INFO=<ID=RS,Number=1,Type=Float,Description="field from from gnotate VCF">

when make-gnotate, It say using type int for "RS"

$ time ./slivar make-gnotate --prefix dbSNP154.debug --field RS:RS ../bgi_anno/test/dbSNP154.lite.vcf.gz
> slivar version: 0.1.12 97859c00a4387c869719330521c36d0b12e92247
[slivar] using type int for "RS"
[slivar] 500000 variants completed. at: 1:1803630. exact: 500000 long: 7105 in ../bgi_anno/test/dbSNP154.lite.vcf.gz
[slivar] 1000000 variants completed. at: 1:3268786. exact: 1000000 long: 14883 in ../bgi_anno/test/dbSNP154.lite.vcf.gz

liserjrqlxue avatar Sep 16 '20 03:09 liserjrqlxue