slivar icon indicating copy to clipboard operation
slivar copied to clipboard

gnotate inconsistent zip (on large files)

Open fcliquet opened this issue 3 years ago • 6 comments

I made some gnotate files for the CADD. This is using slivar 0.2.1.

I first transformed the cadd tsv files to vcf (using your old cadd2vcf.py script from vcfanno) then used slivar make-gnotate --prefix [gunzip-test-cadd.txt](https://github.com/brentp/slivar/files/6296056/gunzip-test-cadd.txt) gnotate/cadd-1.4-SNVs-GRCh37 --field phred:cadd_phred cadd.v1.4.SNVs.hg19.vcf.gz to create the gnotate file.

when trying to use it in slivar through the --gnotate option i get the following error:

zipfiles.nim(54)         open
Error: unhandled exception: Zip archive inconsistent
error opening references/gnotate/cadd-1.4-SNVs-GRCh37.zip [IOError]

I then checked the zip file using unzip -t references/gnotate/cadd-1.4-SNVs-GRCh37.zip and it returned me some errors. I put the full return of this command in the attached file. But the offending lines are:

file #34:  bad zipfile offset (local header sig):  1461945258
file #46:  bad zipfile offset (local header sig):  1195495634
file #49:  bad zipfile offset (local header sig):  1151176738

those lines correspond to the files:

sli.var/2/gnotate-variant.bin
sli.var/3/gnotate-variant.bin
sli.var/4/gnotate-variant.bin

At first I though of a simple file corruption, however I made gnotate file for the cadd 1.4 and 1.6 in each GRCh37 and GRCh38 version. The 4 zip files return the same error. and the gunzip -t command (all of them in the attach file) return problems for the gnotate-variant files of chromosomes 2,3 and 4 in the case of GRCh37 and for chromosomes 2,3,4 and 5 for GRCh38.

I was wondering if you had any idea why this is occuring, if this is a size issue for the chromosomes? but then why no problems on chr1? the zip files (the gnotate Zip files are 25-26G each)?

fcliquet avatar Apr 12 '21 11:04 fcliquet

gunzip-test-cadd.txt sorry forgot the attached file

fcliquet avatar Apr 12 '21 11:04 fcliquet

Hi, would you try the attached binary? I don't know how I didn't see this problem before but I think it's fixed. slivar.gz

brentp avatar Apr 12 '21 13:04 brentp

great! I'm running make-gnotate with it. I will let you know when it's done, probably in 1 or 2 days. thanks a lot.

fcliquet avatar Apr 12 '21 14:04 fcliquet

gunzip-test-cadd-v2.txt slurm-43677843.txt

It is not changing anything. Same error when trying to annotate with it, and the unzip -t outputs are actually exactly the same as before for both GRCh37 and GRCh38.

I put here the log for the creation of one of the gnotate file along with its unzip -t.

fcliquet avatar Apr 13 '21 07:04 fcliquet

Thank you for testing and reporting the result.

Unfortunately, I don't have a solution now. You're sure you ran with the updated binary?

There's also the problem that with something as dense as CADD, slivar will require a lot of memory (249 million * (64+n*32 bits)). I would like to modify the gnotate format to fix these issues, but that's dev time that I don't have at the moment.

brentp avatar Apr 13 '21 07:04 brentp

Yes I'm sure (I deleted all the gnotate zip for the cadd before and double checked the creation log of the new one for the slivar version).

Yes I can imagine that it would cause a lot of potential other problems such as with memory, and totally understand that you don't have time for it at the moment. I will keep using vcfanno for the CADD whenever I need it.

Thanks a lot!

fcliquet avatar Apr 13 '21 09:04 fcliquet