nanbox icon indicating copy to clipboard operation
nanbox copied to clipboard

Encoding ("boxing") is unclear.

Open RokerHRO opened this issue 5 years ago • 5 comments

After reading the README.md I thought that this library "encodes" all the supported data into the 51 "payload" bits of IEEE-754 "double" qNaN values.

But after looking into the header file it seems not being the case, because the highest 13 bits of the 64 bit value are not always set to 1:

 * The top 16-bits denote the type of the encoded nanbox_t:
 *
 *     Pointer {  0000:PPPP:PPPP:PPPP
 *             /  0001:xxxx:xxxx:xxxx
 *     Aux.   {           ...
 *             \  0005:xxxx:xxxx:xxxx
 *     Integer {  0006:0000:IIII:IIII
 *              / 0007:****:****:****
 *     Double  {          ...
 *              \ FFFF:****:****:****

So "pointers", "aux", "integers" and most of all finite "double" values are "encoded" in normal (albeit quite small) IEEE double values, that are not NaN values.

Am I wrong here or did I mis-understand what this library does or shall do?

RokerHRO avatar Feb 05 '20 22:02 RokerHRO

You are right, the README is not clear about this. Feel free to improve the text in the README!

The other values are not encoded as NaNs. Nevertheless, the unused bits in a NaN is used, but the doubles themselves are shifted, as described on the lines just before the ones you quoted:

 * By adding 7 * 2^48 as a 64-bit integer addition, we shift the first 16 bits
 * in the doubles from the range 0000..FFF8 to the range 0007..FFFF.  Doubles
 * are decoded by reversing this operation, i.e. subtracting the same number.

This means that the representation of a double is shifted from the range 0-0xFFF8000000000000 (where the highest one is a NaN with empty payload) to the range 0x0007000000000000-0xFFFF000000000000.

The reason for this is that a pointer can be stored unchanged. The implementation thus favors pointers. It would be more correct to say that we encode doubles and other values as pointers, using unused bits in a pointer.

The above is true for 64-bit platforms. For 32-bit platforms, the doubles are not shifted, so here it is true NaN-boxing.

Do you have a use case where it is preferable to favor doubles and shift pointers to NaN-space instead? If you (or anyone) want to implement another encoding scheme, we can add a macro to control if doubles or pointers should be favored.

zuiderkwast avatar Feb 06 '20 00:02 zuiderkwast

I understand "NaN boxing" as a technique for use cases like this:

A file format or transmission link contains normally "double" values, e.g. measurements from a remote sensor or the like. But sometimes it becomes necessary to transmit "exceptional" / "out of band" data over this channel, but these must not interfere with the receivers that process that normal sensor values. So it might be a good idea to "hide" these exceptional data in the NaN space.

Examples for these "out of band" data might be: timestamps, battery status of the sending device, sender ID or occasionally sent values from a different sensor (e.g. temperature, wind speed etc. for a sensor that regularly only transmits air pressure or the like).

These data are encoded in the lower 32 or 48 bits of the NaN "payload" bits, the remaining 3 upper payload bits encode the semantic meaning (e.g. "timestamp", "temperature", "battery status" etc.), so the data type is also already clear to the "out of band data" receiver and don't need to be encoded additionally, but there might be defined a "default meaning" which just maps to "generic integer", "generic float", etc.

RokerHRO avatar Feb 06 '20 10:02 RokerHRO

That's a good use case! Thanks! (My original motivation was that I wanted a way to encode all these data types in a word for implementing dynamically typed languages. I was inspired by how browsers use these techniques to implement JavaScript.)

If the main data type is double and the other values have to be hidden in NaN payload, of course you need another scheme.

As I said before, I'm open to changing the encoding scheme and/or to have it configurable.

zuiderkwast avatar Feb 06 '20 10:02 zuiderkwast

Hi, I find this idea great, but from the name "nanbox" I expected that std::isnan(nanbox_from_double(3.).as_double) actually is a nan. It's not. When you dump it, its 0100000000001111000000000000000000000000000000000000000000000000

That's probably also what @RokerHRO wants to raise.

hageboeck avatar Apr 28 '20 14:04 hageboeck

Please submit a PR.

zuiderkwast avatar May 05 '20 00:05 zuiderkwast