nonpareil icon indicating copy to clipboard operation
nonpareil copied to clipboard

Can't process gzipped fastq

Open ohthetrees opened this issue 6 years ago • 12 comments

Hi, I'm just getting started with Nonpareil, thanks for your work.

I'm unable to process my gzipped fastq. If I first uncompress the file, it processes as expected. The error:

$ nonpareil -s ETNP_120m_R2.name.fastq.gz -t 4 -T kmer -f fastq -b ETNP_120m_R2.nonpareil.k
Nonpareil v3.301
Fatal error:
The file provided does not have the proper fastq format
 [      0.0] Fatal error: The file provided does not have the proper fastq format

ohthetrees avatar May 17 '18 22:05 ohthetrees

Sorry for the loooong delay, I'm back now at tending to the issues.

I believe this is an issue on the kmer kernel, that doesn't allow gzipped input due to the random access function it uses (@gunturus please comment if I'm wrong).

Unfortunately, I don't think this can be easily resolved. I'll leave this issue open until I add a corresponding comment to the documentation, but you'll have to unzip the fastq file prior to using nonpareil.

lmrodriguezr avatar Aug 28 '19 16:08 lmrodriguezr

I'm starting to investigate nonpareil, and also had the same issue.

Having gzipped input support would be very useful to have, because I have >100 sequencing files all in >1GB file-size range, so having to decompress each time would be a bit nasty when trying to parallelise processing all the files at once.

So I would like to give support to this, if a solution is feasible (even if there is an internal temporary decompression)!

jfy133 avatar Nov 06 '20 10:11 jfy133

@gunturus Do you have an update on this issue? I know you were looking into it. Thanks!

lmrodriguezr avatar Nov 06 '20 18:11 lmrodriguezr

@gunturus do you have any more news? I'm interested in potentially adding nonpariel to the nf-core/eager pipeline, but the lack of gzip support is unfortunately a deal breaker...

jfy133 avatar Feb 25 '21 11:02 jfy133

@jfy133 unfortunately gzip is not supported. @lmrodriguezr do you have any suggestions to provide gzip support? I have no idea.

gunturus avatar Feb 25 '21 12:02 gunturus

Do you think this is in anyway on a roadmap @lmrodriguezr? Just to know if I should look for different solutions instead.

jfy133 avatar Jun 09 '21 11:06 jfy133

I would also like to add that having support for compressed FASTQ files would be good.

VGalata avatar Aug 24 '21 08:08 VGalata

Hello. We're finally back at this issue, and it's top of the roadmap. An initial not-so-clean solution would be to unzip the files into a temporary directory, launch nonpareil, and then remove the directory. Would this work as a temporary solution? If yes, I can implement it into a bash wrapper so you could use it out of the box.

A more robust solution is to read directly from the zipped file, but this will take some heavy lifting because we will need to replace a random file access with another method. It's also doable, but I'll take us a bit longer, so hopefully the first option works in the meantime?

lmrodriguezr avatar Feb 09 '22 16:02 lmrodriguezr

Dear @lmrodriguezr,

Thank you very much for looking into this!

For our purpose, having the second option being implemented would be better. We use nonpareil in a snakemake workflow where we want to move away from using unzipped FASTQ files and we would like to avoid unnecessary unzipping if possible. And, as you are saying it yourself, that would be also a more robust solution and I think it would be worth waiting for it.

VGalata avatar Feb 10 '22 06:02 VGalata

@lmrodriguezr we are in the same situtation as @VGalata as we would like to add it to a nextflow pipeline ;).

However, I think unzipping to a /tmp location & automatic cleanup after might be an OK temporary workaround, as at then at least we ourselves don't then have to deal with the unzipping itself. On the otherhand this depends on the implementatoin, and whether you rely on an internal unzipping library within the bash script, or rely on a tool already used on a users machine (which is much more flaky, unfortunately as it's this is often frustratingly not very portable).

But depending on the time it takes for the more robust solution, I guess I would prefer to wait a bit longer (thus time investment) goes into an 'inbuilt' solution.

jfy133 avatar Feb 10 '22 07:02 jfy133

@lmrodriguezr just another thought... would it be easier to refactor input to allow stdin?

then could simply to zcat <fastq>.gz | nonpareil <additional params?

Just sayin' as also would be fine with me in terms of accepting gzipped input in terms of useability.

jfy133 avatar Mar 14 '22 09:03 jfy133