HdrHistogram_py icon indicating copy to clipboard operation
HdrHistogram_py copied to clipboard

error while decoding serialized histogram produced by rust version

Open tdyas opened this issue 4 years ago • 12 comments

I am generating histograms in Rust and am deserializing in Python using the HDR Histogram libraries for Rust and Python. The Rust code produces a byte array with the encoded histogram which ends up as a bytes instance in Python. (The project is a Python program that integrates with a Rust library via the cpython crate.)

It appears that the Python library is only able to decode the encoded histogram if Rust encodes using hdrhistogram::serialization::V2DeflateSerializer and further encodes it using base64 (via Python's base64.b64encode).

Without the base64 encoding, decoding with histogram = HdrHistogram.decode(encoded_histogram, b64_wrap=False) results in this error:

Traceback (most recent call last):
  ...
  File "XXX/hdrhistogram-0.8.0-cp38-cp38-macosx_10_15_x86_64.whl-install/hdrh/histogram.py", line 580, in decode
    hdr_payload = HdrHistogramEncoder.decode(encoded_histogram, b64_wrap)
  File "XXX/hdrhistogram-0.8.0-cp38-cp38-macosx_10_15_x86_64.whl-install/hdrh/codec.py", line 356, in decode
    hdr_payload = HdrPayload(8, compressed_payload=cpayload)
  File "XXX/hdrhistogram-0.8.0-cp38-cp38-macosx_10_15_x86_64.whl-install/hdrh/codec.py", line 146, in __init__
    self._decompress(compressed_payload)
  File "XXX/hdrhistogram-0.8.0-cp38-cp38-macosx_10_15_x86_64.whl-install/hdrh/codec.py", line 197, in _decompress
    self._data = zlib.decompress(compressed_payload)
zlib.error: Error -3 while decompressing data: incorrect header check

Using uncompressed encoding (via hdrhistogram::serialization::V2Serializer in Rust) and base64 encoding in Python results in this error:

Traceback (most recent call last):
  ...
  File "XXX/hdrhistogram-0.8.0-cp38-cp38-macosx_10_15_x86_64.whl-install/hdrh/histogram.py", line 580, in decode
    hdr_payload = HdrHistogramEncoder.decode(encoded_histogram, b64_wrap)
  File "XXX/hdrhistogram-0.8.0-cp38-cp38-macosx_10_15_x86_64.whl-install/hdrh/codec.py", line 346, in decode
    raise HdrCookieException()
hdrh.codec.HdrCookieException

And using uncompressed encoding without base64 results in:

Traceback (most recent call last):
  ...
  File "XXX/hdrhistogram-0.8.0-cp38-cp38-macosx_10_15_x86_64.whl-install/hdrh/histogram.py", line 580, in decode
    hdr_payload = HdrHistogramEncoder.decode(encoded_histogram, b64_wrap)
  File "XXX/hdrhistogram-0.8.0-cp38-cp38-macosx_10_15_x86_64.whl-install/hdrh/codec.py", line 356, in decode
    hdr_payload = HdrPayload(8, compressed_payload=cpayload)
  File "XXX/hdrhistogram-0.8.0-cp38-cp38-macosx_10_15_x86_64.whl-install/hdrh/codec.py", line 146, in __init__
    self._decompress(compressed_payload)
  File "XXX/hdrhistogram-0.8.0-cp38-cp38-macosx_10_15_x86_64.whl-install/hdrh/codec.py", line 197, in _decompress
    self._data = zlib.decompress(compressed_payload)
zlib.error: Error -3 while decompressing data: incorrect header check

tdyas avatar Jan 08 '21 04:01 tdyas

traceback 1: This looks like an issue with the compressed data (what is base 64 encoded).

traceback 2 and 3: uncompressed histogram is not a valid/supported format as far as I know

If you can provide an example of rust generated histoblob (base64 compressed) that fails decoding in python, I can have a closer look. Have you tried decoding the same histoblob using other decoders (java, C, go...)? Have you tried the reverse (decode in rust a histoblob generated by python library)?

ahothan avatar Jan 08 '21 08:01 ahothan

traceback 1: This looks like an issue with the compressed data (what is base 64 encoded).

The data was the raw set of bytes for the histogram with no base64 encoding. The Rust encoder for compressed histograms does not appear to do base64 encoding. See https://github.com/HdrHistogram/HdrHistogram_rust/blob/89ea97afdfa543a6b7a0ebc8c7d03eddf66affb3/src/serialization/v2_deflate_serializer.rs#L75-L133

traceback 2 and 3: uncompressed histogram is not a valid/supported format as far as I know

The Rust code is able to produce uncompressed histograms though. See https://github.com/HdrHistogram/HdrHistogram_rust/blob/89ea97afdfa543a6b7a0ebc8c7d03eddf66affb3/src/serialization/v2_serializer.rs#L67-L115

If you can provide an example of rust generated histoblob (base64 compressed) that fails decoding in python, I can have a closer look.

The Rust side of the project is here: https://github.com/tdyas/pants/blob/9f4e51cb0bc0293e56c7fa6376f7530d008ceaf5/src/rust/engine/workunit_store/src/lib.rs#L730-L756

On the Python side, I need to encode base64.b64encode on the raw bytes to go from the raw bytes to base64-encoding. Then the Python decoder works.

Have you tried decoding the same histoblob using other decoders (java, C, go...)?

I have not.

Maybe this is a bug in the Rust encoder where it fails to base64 encode?

Have you tried the reverse (decode in rust a histoblob generated by python library)?

I have not. The Python code is the part of the project that uploads histograms out of the Pants build tool into a server for histograms collected in the Rust engine.

tdyas avatar Jan 08 '21 09:01 tdyas

~Histogram serialization does not involve base64; it just produces bytes. See EncodableHistogram#encodeIntoCompressedByteBuffer's implementations in the Java implementation. It may be base64'd later for transport in plain-text environments like a text histogram log, but that's separate -- it would be inefficient to always base64.~ edit: I misread; I thought there was confusion over whether the raw serialized form itself should always be base64'd.

There are 4 kinds of encoding: V0, V1, V2, V2+Deflate. The Rust implementation currently supports the latter two.

marshallpierce avatar Jan 08 '21 13:01 marshallpierce

maybe we can discuss this over https://gitter.im/HdrHistogram/HdrHistogram ? python supports V2 which to my knowledge only supports compressed + base64 and optionally compressed without base64.

ahothan avatar Jan 08 '21 16:01 ahothan

@tdyas it looks like the only path that would work is if you generate on Rust side using hdrhistogram::serialization::V2DeflateSerializer (and without base64) and use the python decode with b64_wrap=False

(was not clear above which format you were using when you say b64_wrap=False did not work)

To move forward, can you send an example of Rust generated compressed histogram (base64 version works) and I can have a look on my side why the python decode fails.

ahothan avatar Jan 08 '21 16:01 ahothan

Here is the failure with a compressed blob with a single observation (and the success once it has been base64 encoded). The value was produced by V2DeflateSerializer in the Rust library.

Python 3.8.6 (default, Nov  2 2020, 08:14:47)
[Clang 12.0.0 (clang-1200.0.32.21)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> encoded = b"\x1c\x84\x93\x14\x00\x00\x00\x1fx\x9c\x93i\x99,\xcc\xc0\xc0\xc0\xcc\x00\x010\x9a\x11J3\xd9\x7f\x800\xfe32\x01\x00E\x0c\x03\x81"
>>> from hdrh.histogram import HdrHistogram
>>> h = HdrHistogram.decode(encoded, b64_wrap=False)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "XXX/foo/lib/python3.8/site-packages/hdrh/histogram.py", line 580, in decode
    hdr_payload = HdrHistogramEncoder.decode(encoded_histogram, b64_wrap)
  File "XXX/foo/lib/python3.8/site-packages/hdrh/codec.py", line 356, in decode
    hdr_payload = HdrPayload(8, compressed_payload=cpayload)
  File "XXX/foo/lib/python3.8/site-packages/hdrh/codec.py", line 146, in __init__
    self._decompress(compressed_payload)
  File "XXX/foo/lib/python3.8/site-packages/hdrh/codec.py", line 197, in _decompress
    self._data = zlib.decompress(compressed_payload)
zlib.error: Error -3 while decompressing data: incorrect header check
>>> import base64
>>> h = HdrHistogram.decode(base64.b64encode(encoded))
>>> h.get_total_count()
1
>>>

Here is the failure with an uncompressed blob produced by V2Serializer in the Rust library:

>>> encoded_uncompressed = b'\x1c\x84\x93\x13\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x02?\xf0\x00\x00\x00\x00\x00\x00\xff\x01\x02'
>>> h = HdrHistogram.decode(encoded_uncompressed, b64_wrap=False)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "XXX/foo/lib/python3.8/site-packages/hdrh/histogram.py", line 580, in decode
    hdr_payload = HdrHistogramEncoder.decode(encoded_histogram, b64_wrap)
  File "XXX/foo/lib/python3.8/site-packages/hdrh/codec.py", line 356, in decode
    hdr_payload = HdrPayload(8, compressed_payload=cpayload)
  File "XXX/foo/lib/python3.8/site-packages/hdrh/codec.py", line 146, in __init__
    self._decompress(compressed_payload)
  File "XXX/foo/lib/python3.8/site-packages/hdrh/codec.py", line 197, in _decompress
    self._data = zlib.decompress(compressed_payload)
zlib.error: Error -3 while decompressing data: incorrect header check
>>> h = HdrHistogram.decode(base64.b64encode(encoded_uncompressed))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "XXX/foo/lib/python3.8/site-packages/hdrh/histogram.py", line 580, in decode
    hdr_payload = HdrHistogramEncoder.decode(encoded_histogram, b64_wrap)
  File "XXX/foo/lib/python3.8/site-packages/hdrh/codec.py", line 346, in decode
    raise HdrCookieException()
hdrh.codec.HdrCookieException

tdyas avatar Jan 10 '21 09:01 tdyas

Yes I got the backtraces but I really need to get a hold on the buffer you pass to decode() so I can try to reproduce on my computer and decode it manually.

h = HdrHistogram.decode(encoded, b64_wrap=False)

The "encoded" buffer, can you copy it here in base64 format?

You can either print directly the result of hdrhistogram::serialization::V2DeflateSerializer with base64 or wrap in base64 the output of hdrhistogram::serialization::V2DeflateSerializer

ahothan avatar Jan 12 '21 17:01 ahothan

Yes I got the backtraces but I really need to get a hold on the buffer you pass to decode() so I can try to reproduce on my computer and decode it manually.

The buffers are in there as Python bytes literals:

encoded = b"\x1c\x84\x93\x14\x00\x00\x00\x1fx\x9c\x93i\x99,\xcc\xc0\xc0\xc0\xcc\x00\x010\x9a\x11J3\xd9\x7f\x800\xfe32\x01\x00E\x0c\x03\x81"

and:

encoded_uncompressed = b'\x1c\x84\x93\x13\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x02?\xf0\x00\x00\x00\x00\x00\x00\xff\x01\x02'

tdyas avatar Jan 12 '21 18:01 tdyas

And here they are converted to base64:

Python 3.8.6 (default, Nov  2 2020, 08:14:47)
[Clang 12.0.0 (clang-1200.0.32.21)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import base64
>>> encoded = b"\x1c\x84\x93\x14\x00\x00\x00\x1fx\x9c\x93i\x99,\xcc\xc0\xc0\xc0\xcc\x00\x010\x9a\x11J3\xd9\x7f\x800\xfe32\x01\x00E\x0c\x03\x81"
>>> base64.b64encode(encoded)
b'HISTFAAAAB94nJNpmSzMwMDAzAABMJoRSjPZf4Aw/jMyAQBFDAOB'

and:

>>> encoded_uncompressed = b'\x1c\x84\x93\x13\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x02?\xf0\x00\x00\x00\x00\x00\x00\xff\x01\x02'
>>> base64.b64encode(encoded_uncompressed)
b'HISTEwAAAAMAAAAAAAAAAwAAAAAAAAABAAAAAAAAAAI/8AAAAAAAAP8BAg=='

tdyas avatar Jan 12 '21 18:01 tdyas

ok here's what I found on the decode of a rust V2 compressed histogram. The base64 encoded string (rust_compressed_b64) works fine when decoding on python:

def test_rust():
    rust_compressed_b64 = "HISTFAAAAB94nJNpmSzMwMDAzAABMJoRSjPZf4Aw/jMyAQBFDAOB"
    histogram = HdrHistogram.decode( rust_compressed_b64)

    rust_compressed = b"\x1c\x84\x93\x14\x00\x00\x00\x1fx\x9c\x93i\x99,\xcc\xc0\xc0\xc0\xcc\x00\x010\x9a\x11J3\xd9\x7f\x800\xfe32\x01\x00E\x0c\x03\x81"
    histogram = HdrHistogram.decode(rust_compressed, b64_wrap=False)

However the non base 64 compressed (rust_compressed) fails. I added some traces to dump the buffer that is being decompressing and they do not match:

########BUFFER len=31
b'789c9369992cccc0c0c0cc0001309a114a33d97f8030fe33320100450c0381'
########BUFFER len=39
b'1c8493140000001f789c9369992cccc0c0c0cc0001309a114a33d97f8030fe33320100450c0381'

As you can see the rust compressed buffer is 8 bytes too long (start of buffer), which explains why the deflate fails/. These first 8 bytes are unexpected: b'1c8493140000001f'

ahothan avatar Jan 18 '21 03:01 ahothan

That's the v2 compressed cookie and the length. 0x1f is 31, which is the length of the buffer.

marshallpierce avatar Jan 18 '21 03:01 marshallpierce

That's the v2 compressed cookie and the length. 0x1f is 31, which is the length of the buffer.

The code path in the decode function for base64-encoding seems to remove the header off the buffer, but the non-base64 code path does not. https://github.com/HdrHistogram/HdrHistogram_py/blob/6462abfaf4b1557769e366ea620ad94f51fbc605/hdrh/codec.py#L353-L355

tdyas avatar Jan 18 '21 04:01 tdyas