libosmium icon indicating copy to clipboard operation
libosmium copied to clipboard

Fails to read a bz2 file

Open frodrigo opened this issue 1 year ago • 1 comments

Test with the master branch of this repo, on Linux (Archlinux).

I got an error reading an osc.xml.bz2. The osc.xml.bz2 was produced by Osmosis. The bz2 file looks good. I have no issue to uncompress it with libbz2.so.1.0. Uncompress test with binary bzcat/bzip2 command and libosmium linked to the same libbz2.so.1.0 binary from system. It is not the first time I got this issue. But it is exceptional, most of the time it is OK with bz2 from Osmosis.

$ ./examples/osmium_count /tmp/diff.osc.xml.bz2
bzip2 error: read failed: -7
echo $?
1
$ bzcat /tmp/diff.osc.xml.bz2
<?xml version='1.0' encoding='UTF-8'?>
<osmChange version="0.6" generator="Osmosis 0.48.3">
  <delete>
...
  </delete>
</osmChange>
$ echo $?
0

If I recompress the bz2 it is ok.

I try to debug and found here feof(m_file.file()) is 0, but 1 recompressed file.

https://github.com/osmcode/libosmium/blob/master/include/osmium/io/bzip2_compression.hpp#L299

Attached original file into a zip (Github does not allow bz2 file).

diff.osc.xml.bz2.zip

frodrigo avatar Jun 03 '24 12:06 frodrigo

When I try with your file I get an BZ_UNEXPECTED_EOF error. So it looks like the file is incomplete. But the bzip2 program doesn't complain. Even running with -vvv doesn't show any problems. The result is the same for the original file as for the recompressed one. No idea what the problem here is.

joto avatar Jun 03 '24 15:06 joto

We had the same problem with osc.bz2 files created by osmium derive-changes in very rare cases (less than 1 in 1000):

  • osmium fails to read these files with a segfault.
  • Files can be decompressed with the cmdline tool bzip2.
  • Compressed with bzip2 again, the file is larger than before.
  • This new file can be processed by osmium without segfault

We try to provide reproducible data sets ...

frankbielig avatar Nov 04 '24 11:11 frankbielig

The problem seems to solved by latest version of libosmium:

  • osmium-tool 1.16 mit libosmium 2.19 -> error
  • osmium-tool 1.16 mit libosmium 2.20 -> OK

frankbielig avatar Nov 11 '24 13:11 frankbielig

The problem seems to solved by latest version of libosmium:

That's strange. The original reporter reported this problem for the master branch in June which is after the 2.20.0 was released. And I can't see any changes between 2.19.0 and 2.20.0 which would even remotely explain this.

So I have to assume that it is rather a more elusive problem that only shows up sometimes and might come back again!?

joto avatar Nov 11 '24 14:11 joto

At the moment we cannot find a dataset to reproduce the problem. In our usage, the problem may be related to this newly reported bug in the osmium tool, which produces incomplete osc.bz2 files.

frankbielig avatar Nov 16 '24 18:11 frankbielig

I don't think that bug has anything to do with this one.

joto avatar Nov 17 '24 10:11 joto

I believe I have found and fixed the bug: bzip2 can store data in blocks, each one encoded like a full bzip2 file. That's useful for parallelization. When reading we have to take this into account and we did. When there was data left in the file after reading a bzip2 block, we'll read the next one. But there is a corner case where it can happen that there seems to be more data but then there isn't. And that case wasn't handled properly. The docs don't mention this case, so I didn't expect that to happen, but apparently it can happen. I don't know why. At least the example file that @frodrigo provided works now.

joto avatar Dec 17 '24 19:12 joto