biokepi icon indicating copy to clipboard operation
biokepi copied to clipboard

URL for B37 decoy has trailing bytes that annoy Gunzip

Open smondet opened this issue 9 years ago • 6 comments

Gunzip succeeds but displays decompression OK, trailing garbage ignored and returns 2.

-q silences the warning: http://www.gzip.org/#faq8 but does not make it return 0.

(URL: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz)

smondet avatar Jan 29 '16 23:01 smondet

Did #118 fix this?

arahuja avatar Feb 10 '16 17:02 arahuja

@arahuja no I found no way to tell nicely to gunzip to ignore those errors without ignoring other potential errors.

The options are:

  • parse the output of gzip to white-list that error (dirty and error prone)
  • host a well-formed fasta.gz somewhere else (easy/fast work-around)
  • ignore all gzip errors; and at the Ketrew-level use a better condition the workflow-node that produces the FASTA (like computing an MD5 sum and checking it) (the nicest but longer to implement we can do that for all downloads)
  • ?

What do you think?

smondet avatar Feb 10 '16 18:02 smondet

Hm, self-hosting seems like a solution that will have it's own issues eventually - unless we just put in this repo?

Ignoring gzip errors and computing sums is nice, but not sure how that is manageable for all downloads.

Looks like 1000genomes acknowledges this issue with the file as well: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/README_human_reference_20110707

This file is compressed by razip from the samtools package for random access. Gzip may complain "decompression OK, trailing garbage ignored", but this does not affect the correctness of the decompressed file.

I think just putting that file here or in Github LFS is the easiest for now.

arahuja avatar Feb 10 '16 18:02 arahuja

@smondet what is this blocked on?

hammer avatar Dec 15 '16 16:12 hammer

@hammer it's bolocked on either 1000genomes providing a proper gz file or us taking a decision on how to bypass the problem :) (I'd like to implement the MD5 solution one day but self-hosting seems to me like the fastest route)

smondet avatar Dec 15 '16 16:12 smondet

@smondet sounds like we're not blocked then, we should implement the self-hosting workaround.

hammer avatar Dec 17 '16 18:12 hammer