Fails to read a bz2 file
Test with the master branch of this repo, on Linux (Archlinux).
I got an error reading an osc.xml.bz2. The osc.xml.bz2 was produced by Osmosis. The bz2 file looks good. I have no issue to uncompress it with libbz2.so.1.0.
Uncompress test with binary bzcat/bzip2 command and libosmium linked to the same libbz2.so.1.0 binary from system.
It is not the first time I got this issue. But it is exceptional, most of the time it is OK with bz2 from Osmosis.
$ ./examples/osmium_count /tmp/diff.osc.xml.bz2
bzip2 error: read failed: -7
echo $?
1
$ bzcat /tmp/diff.osc.xml.bz2
<?xml version='1.0' encoding='UTF-8'?>
<osmChange version="0.6" generator="Osmosis 0.48.3">
<delete>
...
</delete>
</osmChange>
$ echo $?
0
If I recompress the bz2 it is ok.
I try to debug and found here feof(m_file.file()) is 0, but 1 recompressed file.
https://github.com/osmcode/libosmium/blob/master/include/osmium/io/bzip2_compression.hpp#L299
Attached original file into a zip (Github does not allow bz2 file).
When I try with your file I get an BZ_UNEXPECTED_EOF error. So it looks like the file is incomplete. But the bzip2 program doesn't complain. Even running with -vvv doesn't show any problems. The result is the same for the original file as for the recompressed one. No idea what the problem here is.
We had the same problem with osc.bz2 files created by osmium derive-changes in very rare cases (less than 1 in 1000):
osmiumfails to read these files with a segfault.- Files can be decompressed with the cmdline tool
bzip2. - Compressed with
bzip2again, the file is larger than before. - This new file can be processed by
osmiumwithout segfault
We try to provide reproducible data sets ...
The problem seems to solved by latest version of libosmium:
- osmium-tool 1.16 mit libosmium 2.19 -> error
- osmium-tool 1.16 mit libosmium 2.20 -> OK
The problem seems to solved by latest version of libosmium:
That's strange. The original reporter reported this problem for the master branch in June which is after the 2.20.0 was released. And I can't see any changes between 2.19.0 and 2.20.0 which would even remotely explain this.
So I have to assume that it is rather a more elusive problem that only shows up sometimes and might come back again!?
At the moment we cannot find a dataset to reproduce the problem. In our usage, the problem may be related to this newly reported bug in the osmium tool, which produces incomplete osc.bz2 files.
I don't think that bug has anything to do with this one.
I believe I have found and fixed the bug: bzip2 can store data in blocks, each one encoded like a full bzip2 file. That's useful for parallelization. When reading we have to take this into account and we did. When there was data left in the file after reading a bzip2 block, we'll read the next one. But there is a corner case where it can happen that there seems to be more data but then there isn't. And that case wasn't handled properly. The docs don't mention this case, so I didn't expect that to happen, but apparently it can happen. I don't know why. At least the example file that @frodrigo provided works now.