[Java] Inconsistent Byte Order in Serialization Breaks Cross-Platform Compatibility
Search before asking
- [x] I had searched in the issues and found no similar issues.
Version
latest commit
Component(s)
Java
Minimal reproduce step
The core of the logic is as follows:
-
Platform-dependent constants are defined at class-load time. The code checks the machine's native endianness and sets shift constants (
HI_BYTE_SHIFT,LO_BYTE_SHIFT) accordingly. On a Little-Endian machine, these constants are set up one way, and on a Big-Endian machine, they are set up the opposite way.// StringUTF16.java static { if (ByteOrder.nativeOrder() == ByteOrder.BIG_ENDIAN) { HI_BYTE_SHIFT = 8; LO_BYTE_SHIFT = 0; } else { HI_BYTE_SHIFT = 0; LO_BYTE_SHIFT = 8; } } -
These constants are used to serialize multi-byte data. For example, when writing a
char(2 bytes):// in StringSerializer::offHeapWriteCharsUTF16 tmpArray[i] = (byte) (c >> HI_BYTE_SHIFT); tmpArray[i + 1] = (byte) (c >> LO_BYTE_SHIFT);
Logical Consequence:
- On a Little-Endian machine, this logic assembles bytes in Little-Endian order (
[low_byte, high_byte]). - On a Big-Endian machine, the exact same code assembles bytes in Big-Endian order (
[high_byte, low_byte]).
The serialized output is therefore inherently tied to the architecture of the machine that created it.
What did you expect to see?
I expect the serialized byte stream for any given object to be identical, regardless of the host machine's endianness. A serialization framework must enforce a single, canonical byte order to ensure data portability.
What did you see instead?
Instead, the generated byte stream's endianness is coupled to the host machine's native architecture. Data serialized on a Little-Endian machine is in Little-Endian format, while the same data serialized on a Big-Endian machine is in Big-Endian format. This prevents cross-platform data exchange.
Anything Else?
Please note that this report is based on a logical analysis of the code. While it has not been empirically tested on a physical Big-Endian machine. If there is any problem, please point it out in the comments.
Are you willing to submit a PR?
- [x] I'm willing to submit a PR!
For string, how about always using little endian order to reduce code complexity? And for other array, we could embed a bit into length to indicate the endian of buffer