Awkward array should work on big-endian machines
Version of Awkward Array
HEAD
Description and code to reproduce
Starting an issue that comes from this PR https://github.com/scikit-hep/awkward/pull/3629 to continue the discussion in a good manner.
Awkward array is entirely broken on big endian systems and there is no ci that tests this. As stated in the PR, just creating an array fails
>>> import awkward
>>> array = awkward.Array([1, 2, 3])
gives
TypeError: size of array (3) is less than size of form (216172782113783808)
Also other things like ak.from_buffers, ak.to_buffers and round-trips between them error or give wrong results on big endian machines.
Related to https://github.com/scikit-hep/awkward/issues/3356 because that's actually the problem regarding array construction. Solving https://github.com/scikit-hep/awkward/issues/3356, solves array creation but doesn't solve the rest of the problems.
I'm pasting below how how I got most of the ci working on big endian while using a QEMU s390x VM. My logic here was the following:
- awkward array internally operates with the native byteorder and therefore the buffers should probably have the native byteorder when a layout is instantiated with them as their data. Otherwise we get errors about unsupported dtypes coming from
ak.types.numpytype. - The buffer in from buffers should be viewed according to the byteorder argument.
- I switched the default byteorder in to/from buffers to the native one. This is not required but a lot of tests create manually the buffers as numpy arrays which will be created with the native byteorder and then from buffers tries to interpret them with the "wrong" one on big-endian systems. It's mostly that the tests need to adapt for this but I made this change to make most of them pass. It's basically because some tests say
buffers = {"key": np.array(...)}and the np array is big endian on a big endian system and then the tests doesak.from_bufferswherebyteorder="<"by default so it's the opposite of the created array The tests are the ones that need to adapt for this actually but it was easier to change the default argument than hunt down tests just to make ci happy
diff --git a/src/awkward/forms/form.py b/src/awkward/forms/form.py
index 3e5d6613..0b226b88 100644
--- a/src/awkward/forms/form.py
+++ b/src/awkward/forms/form.py
@@ -373,11 +373,11 @@ def regularize_buffer_key(buffer_key: str | Callable) -> Callable[[Form, str], s
index_to_dtype: Final[dict[str, DType]] = {
- "i8": np.dtype("<i1"),
- "u8": np.dtype("<u1"),
- "i32": np.dtype("<i4"),
- "u32": np.dtype("<u4"),
- "i64": np.dtype("<i8"),
+ "i8": np.dtype("i1"),
+ "u8": np.dtype("u1"),
+ "i32": np.dtype("i4"),
+ "u32": np.dtype("u4"),
+ "i64": np.dtype("i8"),
}
diff --git a/src/awkward/operations/ak_from_buffers.py b/src/awkward/operations/ak_from_buffers.py
index 030c4954..2c93fc49 100644
--- a/src/awkward/operations/ak_from_buffers.py
+++ b/src/awkward/operations/ak_from_buffers.py
@@ -34,7 +34,7 @@ def from_buffers(
buffer_key="{form_key}-{attribute}",
*,
backend="cpu",
- byteorder="<",
+ byteorder=ak._util.native_byteorder,
allow_noncanonical_form=False,
highlevel=True,
behavior=None,
@@ -233,7 +233,10 @@ def _from_buffer(
elif isinstance(buffer, PlaceholderArray) or nplike.is_own_array(buffer):
# Require 1D buffers
copy = None if isinstance(nplike, Jax) else False # Jax can not avoid this
- array = nplike.reshape(buffer.view(dtype), shape=(-1,), copy=copy)
+ array = ak._util.native_to_byteorder(buffer, byteorder).view(
+ dtype.newbyteorder("=")
+ )
+ array = nplike.reshape(array, shape=(-1,), copy=copy)
# we can't compare with count or slice when we're working with tracers
if not (isinstance(nplike, Jax) and nplike.is_currently_tracing()):
@@ -246,7 +249,9 @@ def _from_buffer(
return array
else:
array = nplike.frombuffer(buffer, dtype=dtype, count=count)
- return ak._util.native_to_byteorder(array, byteorder)
+ return ak._util.native_to_byteorder(array, byteorder).view(
+ dtype.newbyteorder("=")
+ )
def _reconstitute(
diff --git a/src/awkward/operations/ak_to_buffers.py b/src/awkward/operations/ak_to_buffers.py
index 38853728..de9401f1 100644
--- a/src/awkward/operations/ak_to_buffers.py
+++ b/src/awkward/operations/ak_to_buffers.py
@@ -21,7 +21,7 @@ def to_buffers(
*,
id_start=0,
backend=None,
- byteorder="<",
+ byteorder=ak._util.native_byteorder,
):
"""
Args:
I'm still getting the following failures but most of them pass.
FAILED tests/test_0404_array_validity_check.py::test_subranges_equal - assert False is True
FAILED tests/test_1345_avro_reader.py::test_bytes - ValueError: buffer source array is read-only
FAILED tests/test_1345_avro_reader.py::test_string - ValueError: buffer source array is read-only
FAILED tests/test_1345_avro_reader.py::test_fixed - ValueError: buffer source array is read-only
FAILED tests/test_1345_avro_reader.py::test_arrays_int - ValueError: buffer source array is read-only
FAILED tests/test_1345_avro_reader.py::test_array_string - ValueError: buffer source array is read-only
FAILED tests/test_1345_avro_reader.py::test_Unions_string_null - ValueError: buffer source array is read-only
FAILED tests/test_1345_avro_reader.py::test_Unions_null_X_Y - ValueError: buffer source array is read-only
FAILED tests/test_2067_to_buffers_byteorder.py::test_byteorder_default - AssertionError: assert b'\x00\x00\x0...0\x00\x00\x00' == b'\x00\x00\x0...0\x00\x00\x04'
FAILED tests/test_2198_almost_equal.py::test_dtype - TypeError: unsupported dtype: dtype('<M8[D]'). Must be one of
FAILED tests/test_2305_nep_18_lazy_conversion.py::test_binary - TypeError: unsupported dtype: dtype('<u4'). Must be one of
FAILED tests/test_2424_almost_equal_union_record.py::test_records_almost_equal - TypeError: unsupported dtype: dtype('<M8[s]'). Must be one of
FAILED tests/test_2424_almost_equal_union_record.py::test_unions_almost_equal - TypeError: unsupported dtype: dtype('<M8[s]'). Must be one of
FAILED tests/test_2604_read_awkward1_pickles.py::test_numpyarray - assert [720575940379...2782113783808] == [1, 2, 3]
FAILED tests/test_2604_read_awkward1_pickles.py::test_partitioned_numpyarray - assert [720575940379...5564227567616] == [1, 2, 3, 4, 5, 6]
FAILED tests/test_2604_read_awkward1_pickles.py::test_listoffsetarray - TypeError: size of array (5) is less than size of form (360287970189639680)
FAILED tests/test_2604_read_awkward1_pickles.py::test_regulararray - AssertionError: assert [[[0, 7205759...27099910144]]] == [[[0, 1, 2, 3... 27, 28, 29]]]
FAILED tests/test_2604_read_awkward1_pickles.py::test_strings - TypeError: size of array (11) is less than size of form (792633534417207296)
FAILED tests/test_2604_read_awkward1_pickles.py::test_indexedoptionarray - TypeError: size of array (3) is less than size of form (144115188075855873)
FAILED tests/test_2604_read_awkward1_pickles.py::test_bytemaskedarray - assert [720575940379...7970189639680] == [1, 2, None, None, 5]
FAILED tests/test_2604_read_awkward1_pickles.py::test_unmaskedarray - assert [720575940379...2782113783808] == [1, 2, 3]
FAILED tests/test_2604_read_awkward1_pickles.py::test_recordarray - AssertionError: assert [{'x': 720575...6574978e-180}] == [{'x': 1, 'y'... 2, 'y': 2.2}]
FAILED tests/test_2604_read_awkward1_pickles.py::test_recordarray_tuple - assert [(72057594037...6574978e-180)] == [(1, 1.1), (2, 2.2)]
FAILED tests/test_2604_read_awkward1_pickles.py::test_unionarray - TypeError: size of array (2) is less than size of form (72057594037927937)
FAILED tests/test_2665_out_of_band_pickle.py::test_protocol_5 - AssertionError: assert False
FAILED tests/test_2682_custom_pickler.py::test_non_packing_pickler - TypeError: size of array (12) is less than size of form (648518346341351424)
FAILED tests/test_2857_full_like_scalar.py::test - TypeError: unsupported dtype: dtype('<M8[s]'). Must be one of
FAILED tests/test_2857_full_like_scalar.py::test_typetracer - TypeError: unsupported dtype: dtype('<M8[s]'). Must be one of
FAILED tests/test_3059_boolean_kernels.py::test_bool_subranges_equal - assert False is True
FAILED tests/test_3209_awkwardforth_read_negative_number_of_items.py::test_read_negative_and_positive_number_of_items - assert [0, 0, 0, 0, 0] == [1, 2, 3, 4, 5]
FAILED tests/test_3209_awkwardforth_read_negative_number_of_items.py::test_read_positive_and_negative_number_of_items - assert [0, 0, 0, 0, 0] == [1, 2, 3, 4, 5]
================================================ 31 failed, 2765 passed, 157 skipped in 215.25s (0:03:35) =================================================
@ikrommyd - thanks for opening an issue!
I guess the awkward1 pickle reading test failures above are normal failures because the pickle files have been created on a little endian machine? Maybe my assumption is wrong here though. Also the same for the avro files? If that's the case, then it's less actual test failures than it appears to be.
Regarding this comment here on the PR: https://github.com/scikit-hep/awkward/pull/3629#issuecomment-3211847015,
I don't understand this fully to be honest and I'd like to ask for some deeper explanation for my understanding. Is this when you are writing to something or when reading? Because when creating arrays, you can't know what kind of buffers people pass in. I mean there is no metadata to tell you if an array is big or little endian. What you can only do is byteswap. I think this is what @agoose77 meant when he said that ak.from_buffers should have a byteorder argument. You can tell people that all the buffers should be little endian but
- that will currently error on layout creation because the whole awkward code base expects the input dtypes to be system native.
- even if you change that requirement and say that all input buffers should be little-endian, that will immediately change by all the intermediate arrays and numpy. Numpy casts to system native instantly.
In [1]: import numpy as np
In [2]: x = np.array([1,2,3], dtype=">i8")
In [3]: y = np.array([2,3,4], dtype=">i8")
In [4]: x.dtype, y.dtype
Out[4]: (dtype('>i8'), dtype('>i8'))
In [5]: x + y
Out[5]: array([3, 5, 7])
In [6]: (x + y).dtype
Out[6]: dtype('int64')
In [7]: (x + y).dtype.byteorder
Out[7]: '='
When writing out, you can of course say that awkward should write out little-endian.
The idea is that we treat the default arguments to from/to buffers as designing the IO for little endianness. From buffers can take two kinds of buffer - typed, with endianness info, or untyped. Right now we convert to endianness, if I recall. We could instead throw an error when the endianness doesn't match the argument.