slivar
slivar copied to clipboard
gnotate inconsistent zip (on large files)
I made some gnotate files for the CADD. This is using slivar 0.2.1.
I first transformed the cadd tsv files to vcf (using your old cadd2vcf.py script from vcfanno) then used
slivar make-gnotate --prefix [gunzip-test-cadd.txt](https://github.com/brentp/slivar/files/6296056/gunzip-test-cadd.txt) gnotate/cadd-1.4-SNVs-GRCh37 --field phred:cadd_phred cadd.v1.4.SNVs.hg19.vcf.gz
to create the gnotate file.
when trying to use it in slivar through the --gnotate option i get the following error:
zipfiles.nim(54) open
Error: unhandled exception: Zip archive inconsistent
error opening references/gnotate/cadd-1.4-SNVs-GRCh37.zip [IOError]
I then checked the zip file using unzip -t references/gnotate/cadd-1.4-SNVs-GRCh37.zip
and it returned me some errors. I put the full return of this command in the attached file. But the offending lines are:
file #34: bad zipfile offset (local header sig): 1461945258
file #46: bad zipfile offset (local header sig): 1195495634
file #49: bad zipfile offset (local header sig): 1151176738
those lines correspond to the files:
sli.var/2/gnotate-variant.bin
sli.var/3/gnotate-variant.bin
sli.var/4/gnotate-variant.bin
At first I though of a simple file corruption, however I made gnotate file for the cadd 1.4 and 1.6 in each GRCh37 and GRCh38 version. The 4 zip files return the same error. and the gunzip -t command (all of them in the attach file) return problems for the gnotate-variant files of chromosomes 2,3 and 4 in the case of GRCh37 and for chromosomes 2,3,4 and 5 for GRCh38.
I was wondering if you had any idea why this is occuring, if this is a size issue for the chromosomes? but then why no problems on chr1? the zip files (the gnotate Zip files are 25-26G each)?
gunzip-test-cadd.txt sorry forgot the attached file
Hi, would you try the attached binary? I don't know how I didn't see this problem before but I think it's fixed. slivar.gz
great! I'm running make-gnotate with it. I will let you know when it's done, probably in 1 or 2 days. thanks a lot.
gunzip-test-cadd-v2.txt slurm-43677843.txt
It is not changing anything. Same error when trying to annotate with it, and the unzip -t outputs are actually exactly the same as before for both GRCh37 and GRCh38.
I put here the log for the creation of one of the gnotate file along with its unzip -t.
Thank you for testing and reporting the result.
Unfortunately, I don't have a solution now. You're sure you ran with the updated binary?
There's also the problem that with something as dense as CADD, slivar will require a lot of memory (249 million * (64+n*32 bits)). I would like to modify the gnotate format to fix these issues, but that's dev time that I don't have at the moment.
Yes I'm sure (I deleted all the gnotate zip for the cadd before and double checked the creation log of the new one for the slivar version).
Yes I can imagine that it would cause a lot of potential other problems such as with memory, and totally understand that you don't have time for it at the moment. I will keep using vcfanno for the CADD whenever I need it.
Thanks a lot!