fury icon indicating copy to clipboard operation
fury copied to clipboard

[Java] Inconsistent Byte Order in Serialization Breaks Cross-Platform Compatibility

Open LouisLou2 opened this issue 4 months ago • 1 comments

Search before asking

  • [x] I had searched in the issues and found no similar issues.

Version

latest commit

Component(s)

Java

Minimal reproduce step

The core of the logic is as follows:

  1. Platform-dependent constants are defined at class-load time. The code checks the machine's native endianness and sets shift constants (HI_BYTE_SHIFT, LO_BYTE_SHIFT) accordingly. On a Little-Endian machine, these constants are set up one way, and on a Big-Endian machine, they are set up the opposite way.

    // StringUTF16.java
    static {
      if (ByteOrder.nativeOrder() == ByteOrder.BIG_ENDIAN) {
        HI_BYTE_SHIFT = 8; LO_BYTE_SHIFT = 0;
      } else {
        HI_BYTE_SHIFT = 0; LO_BYTE_SHIFT = 8;
      }
    }
    
  2. These constants are used to serialize multi-byte data. For example, when writing a char (2 bytes):

    // in StringSerializer::offHeapWriteCharsUTF16
    tmpArray[i]     = (byte) (c >> HI_BYTE_SHIFT);
    tmpArray[i + 1] = (byte) (c >> LO_BYTE_SHIFT);
    

Logical Consequence:

  • On a Little-Endian machine, this logic assembles bytes in Little-Endian order ([low_byte, high_byte]).
  • On a Big-Endian machine, the exact same code assembles bytes in Big-Endian order ([high_byte, low_byte]).

The serialized output is therefore inherently tied to the architecture of the machine that created it.

What did you expect to see?

I expect the serialized byte stream for any given object to be identical, regardless of the host machine's endianness. A serialization framework must enforce a single, canonical byte order to ensure data portability.

What did you see instead?

Instead, the generated byte stream's endianness is coupled to the host machine's native architecture. Data serialized on a Little-Endian machine is in Little-Endian format, while the same data serialized on a Big-Endian machine is in Big-Endian format. This prevents cross-platform data exchange.

Anything Else?

Please note that this report is based on a logical analysis of the code. While it has not been empirically tested on a physical Big-Endian machine. If there is any problem, please point it out in the comments.

Are you willing to submit a PR?

  • [x] I'm willing to submit a PR!

LouisLou2 avatar Aug 05 '25 09:08 LouisLou2

For string, how about always using little endian order to reduce code complexity? And for other array, we could embed a bit into length to indicate the endian of buffer

chaokunyang avatar Aug 06 '25 06:08 chaokunyang