ngless
ngless copied to clipboard
Reading .bz2 files fails to decompress or segfaults
This was tested using the 1.0.0 conda build (is this one just the wrapped static build?) as well as with several different 'static' and containerized versions from 0.9 to 1.0.1.
In all cases loading of data failed at the same step but depending on the version and how it was compiled two errors were seen:
...
[Wed 31-07-2019 11:24:23] Line 13: Created & opened temporary file /tmp/preprocessed.singles...fq12609-4.gz
/.singularity.d/runscript: line 3: 12609 Segmentation fault (core dumped) ngless "$@"
and
...
[Wed 31-07-2019 11:23:51] Line 13: Created & opened temporary file /tmp/preprocessed.singles...fq8945-4.gz
Exiting after internal error. If you can reproduce this issue, please run your script with the --trace flag and report a bug at http://github.com/ngless-toolkit/ngless/issues
user error (BZ2_bzDecompress: -1)
We didn't try the docker containers but those also make use of the static builds so they should be equally affected.
I also tried using the same binary on the bz2 files in our testsuite and all worked fine which hints at some buffer or filesize related issue.
Currently in the process of creating a bz2 file that is big enough to trigger the error locally. If not too big I'll add this to the testsuite.
Credits to @jakob-wirbel for finding this bug.
Some interesting findings.
If using pbzip2 the parallel version of bzip2 to create the files, ngless is able to consume the files up to a certain size. In the test-case I setup locally a Fastq file with 9724 lines, (266413 bytes compressed, 900170 uncompressed) causes ngless to fail with BZ2_bzDecompress: -1. Regular unix bzip2 is able to decompress the file without problems.
On the other hand if using regular bzip2, tried as many as 90000 lines and ngless is still able to consume the files without error.
From pbzip2 manual page:
Files that are compressed with pbzip2 are broken up into pieces and each individual piece is compressed.
This is how pbzip2 runs faster on multiple CPUs since the pieces can be compressed simultaneously.
The final .bz2 file may be slightly larger than if it was compressed with the regular bzip2 program
due to this file splitting (usually less than 0.2% larger). Files that are compressed with pbzip2 will
also gain considerable speedup when decompressed using pbzip2.
Files that were compressed using bzip2 will not see speedup since bzip2 packages the data into a
single chunk that cannot be split between processors.
This might be what is causing the problem.
Also:
% file *
DRR171944_1.fastq.bz2: bzip2 compressed data, block size = 900k
DRR171944_2.fastq.bz2: bzip2 compressed data, block size = 900k
DRR171944.singles.fastq.bz2: bzip2 compressed data, block size = 900k
% pbzip2 --help
...
-1 .. -9 set BWT block size to 100k .. 900k (default 900k)
-b# Block size in 100k steps (default 9 = 900k)
...
That block size value matches the 900170 uncompressed value above.
Commit c214d2213607bf09258a1d594fcdaecbf9ee9780 adds a compressed bz2 file that shows this symptoms.
One of the tests was also modified to use this and currently fails.
The test fails locally with the latest 1.0.1 static and with a build compiled from master.
Thanks! Fortunately, this fails on travis too, so we have a test.
Some other tests are now wrong because they all shared the same expected.fq file, but arguably they should not have been set up like this in the first place
Oops I'll fix that
Actually, I was fixing it on my side, so give me a few minutes.
ok
The other tests are fixed by making them as before and moving this issue to a new test.
For efficiency, it's good to have tests that cover a bunch of issues simultaneously, but this was the simplest way.
This is an upstream issue, reported it there.
This has been merged upstream (https://github.com/snoyberg/bzlib-conduit/pull/7). Once we have a new release and that makes it into the stackage LTS, we can just bump the version that NGLess uses and close here.
Hi, I see this is an old thread, but this happens to me using ngless 1.5 too. What is the fix? Cheers, Ulrike
Can you perhaps share one such file?
yes, I will send you a link