cbor2 icon indicating copy to clipboard operation
cbor2 copied to clipboard

Adds support for decoding floating-point typed arrays from RFC8746

Open tgockel opened this issue 3 years ago • 13 comments

This adds support for decoding arrays of floating point numbers of IEEE 754 formats binary16, binary32, and binary64 in both the big- and little-endian form.


If this looks good, we can add unsigned and signed integers using the same general ideas...and also encoders for these special markers.

tgockel avatar May 25 '21 23:05 tgockel

Hi @tgockel, thanks for doing this. I had done some experiments a while back decoding typed arrays into python array.array types. I think that might be faster. It also lets you do a round-trip:

https://github.com/Sekenre/cbor2/commit/a117ad37375a1811c1e0188a3ab6693284b94a70

This is related to #32 and is maybe a simple way to handle it without needing numpy as a dependency.

Let me know what you think, I'm open to suggestions.

Sekenre avatar May 28 '21 10:05 Sekenre

Coverage Status

Coverage decreased (-0.3%) to 96.892% when pulling 70e6f6c707b67c7a0a40fc84f27f5f85d9006f56 on tgockel:rfc8746 into 9f30439a9cd9cbb4103bd0c4882a48d9c85eb84d on agronholm:master.

coveralls avatar May 28 '21 10:05 coveralls

I have never seen array before, but it definitely seems like the right approach instead of the weird struct trickery I did. Unfortunately, array.array doesn't have support for half-precision floats, but I updated the single- and double-precision floating point algorithm to use it.

The biggest issue I see is immutability -- array.array does not have a convenient method like numpy's array.setflags(write=False) for this. I left comments with TODO(tgockel/111) for this, but I don't know an elegant way to address this one.

tgockel avatar May 31 '21 06:05 tgockel

The biggest issue I see is immutability -- array.array does not have a convenient method like numpy's array.setflags(write=False) for this. I left comments with TODO(tgockel/111) for this, but I don't know an elegant way to address this one.

If you want it to be immutable, you can wrap the bytes in a memoryview and then cast it, like this:

>>> my_array = memoryview(b'\x1f\x85\xebQ\xb8\x1e\t@').cast('d')
>>> assert my_array[0] == 3.14
>>> my_array[0] = 2.16
Traceback (most recent call last):
  File "<pyshell#57>", line 1, in <module>
    myarray[0] = 2.14
TypeError: cannot modify read-only memory

Sekenre avatar Jun 02 '21 16:06 Sekenre

That unfortunately doesn't work because the ultimate point of making this read-only is so that it can be used as keys in a dictionary, but memoryview hashing has a shortcoming:

ValueError: memoryview: hashing is restricted to formats 'B', 'b' or 'c'

tgockel avatar Jun 02 '21 16:06 tgockel

I tried writing a little class to represent a float16 array instead of converting to a list of floats and posted it here: https://codereview.stackexchange.com/q/261573/243247. This lets you write an encoder that can just copy the underlying buffer into the output. This could be added to cbor2.types.

Sekenre avatar Jun 05 '21 17:06 Sekenre

There's an interesting question on hashing -- should the endianness of the generated source affect hashing? Let's say an x86 machine and an AArch64 machine both generate [1.5, 2.5] and encode it as a half-precision typed array...let's call them arr_le and arr_be. Should the hash(arr_le) == hash(arr_be)? What about hash((1.5, 2.5))? I think a user would expect all 3 hashes to be equal.

This gets even more hairy when we get into integer v float comparisons. In Python, hash(2) == hash(2.0). Per the documentation of hash:

Numeric values that compare equal have the same hash value (even if they are of different types, as is the case for 1 and 1.0).

This extends to tuples, as hash((2, 3, 4)) == hash((2.0, 3.0, 4.0)).

I'm not sure there is a good answer here. My solution of calling tuple(input) has the disadvantage of poor performance, but it only happens when a typed array is used as a key to a map, which I don't think happens all that frequently in the world.

tgockel avatar Jun 06 '21 20:06 tgockel

should the endianness of the generated source affect hashing?

IMO: No it should not, foreign endian data should always be converted to native endian prior to hashing, and each platform should write arrays in their native format since it can always be unambiguously tagged as such.

This extends to tuples, as hash((2, 3, 4)) == hash((2.0, 3.0, 4.0))

Does that hashing behaviour hold true for numpy 1d arrays? Would it just be easier to require numpy for handling these?

Sekenre avatar Jun 09 '21 12:06 Sekenre

numpy arrays avoid the problem by not being hashable.

tgockel avatar Jun 14 '21 00:06 tgockel

@tgockel @Sekenre do you have plans to merge this pull request? Typed arrays is exactly the feature I miss

escherstair avatar Nov 17 '22 15:11 escherstair

Bump. Any movement on getting various floating point formats encoded with CBOR?

brendan-simon-indt avatar Jan 20 '23 05:01 brendan-simon-indt

Bump. Any movement on getting various floating point formats encoded with CBOR?

The problem with immutability/hashability has not been solved yet. If you want this faster, participate in the process of finding solutions.

agronholm avatar Jul 14 '23 11:07 agronholm

Bump. Any movement on getting various floating point formats encoded with CBOR?

The problem with immutability/hashability has not been solved yet. If you want this faster, participate in the process of finding solutions.

I found a solution that works for me - casting to np.floatX, then back to float, then use canonical=True when encoding.

value_to_encode = float( np.float16( value ) )

brendan-simon-indt avatar Jul 14 '23 12:07 brendan-simon-indt