Wrong array write when writing chunked array from numpy string data with "F" order
Zarr version
v3.1.3
Numcodecs version
v0.16.3
Python Version
3.12.7, 3.13.9
Operating System
Windows
Installation
pip into virtual environment
Description
I have a pandas DataFrame storing string data with string[pyarrow] datatype.
When converting this data to numpy array with either "df.values" or "df.to_numpy()" it produces a numpy "object" array (dtype=="O").
When storing this array in a chunked zarr-array there seems to be a strange "off-by-one" error either when writing or when reading. The result of the array read back from zarr differs from the original array.
When the DataFrame was stored as normal (python) string, this does not happen even though the numpy array looks exactly the same.
Steps to reproduce
# /// script
# requires-python = ">=3.12"
# dependencies = [
# "zarr@git+https://github.com/zarr-developers/zarr-python.git@main",
# "pandas",
# "pyarrow",
# ]
# ///
#
# This script automatically imports the development branch of zarr to check for issues
import zarr
# your reproducer code
import numpy as np
import pandas as pd
# Number of rows and columns
num_rows = 11
num_cols = 11
# Generate incrementing strings
incrementing_strings = [f"S{i:05}" for i in range(1, num_rows * num_cols + 1)]
npdata = np.array(incrementing_strings, dtype="str").reshape(num_rows, num_cols)
df = pd.DataFrame(npdata)
dfpa = df.astype("string[pyarrow]")
data = df.values
datapa = dfpa.values
# According to numpy, the converted data is identical:
assert data.dtype == datapa.dtype # dtype == 'O'
assert data.shape == datapa.shape
assert data.size == datapa.size
assert np.all(data == datapa)
# Now store both in a zarray (using the same array here but doesn't matter if I use a fresh one)
store = zarr.storage.MemoryStore() # Or LocalStore, doesn't matter
array = zarr.create_array(
store,
name="s",
shape=data.shape,
chunks=(10,10),
fill_value="<NA>",
dtype=str,
)
# This works as expected
array[:,:] = data
zdata = array[:,:]
assert np.all(zdata == data)
# This doesn't
array[:,:] = datapa
zdata = array[:,:]
print(f"written={datapa[0,:4]}") # ['S00001' 'S00002' 'S00003' 'S00004']
print(f"read={zdata[0,:4]}") # ['S00001' 'S00012' 'S00023' 'S00034']
assert np.all(zdata == datapa) # Fails
Additional output
No response
The problem seems to be on the write side because the generated chunk files are different when using LocalStore ("0" chunk has 256 bytes in the pa version versus 200 with the non-pa-strings)
For what its worth: The memory layout of the two numpy arrays is different:
*** data strides and flags: (88, 8) C_CONTIGUOUS : True F_CONTIGUOUS : False OWNDATA : False WRITEABLE : True ALIGNED : True WRITEBACKIFCOPY : False
*** datapa strides and flags (8, 88) C_CONTIGUOUS : False F_CONTIGUOUS : True OWNDATA : False WRITEABLE : True ALIGNED : True WRITEBACKIFCOPY : False
If I use datapa = np.asarray(dfpa.values, order="C") the problem disappears, so looks definitely like a handling problem with the "F" memory layout
I modified the example to remove pandas and pyarrow since they are not relevant any more:
# /// script
# requires-python = ">=3.12"
# dependencies = [
# "zarr@git+https://github.com/zarr-developers/zarr-python.git@main",
# ]
# ///
#
# This script automatically imports the development branch of zarr to check for issues
import zarr
import numpy as np
ndata = np.arange(9).reshape(3,3)
data = np.asarray(ndata, order="F") # does work with "C"
store = zarr.storage.MemoryStore() # Or LocalStore, doesn't matter
chunk_size = 2 # Doesn't work with 2 or 3 but works for 1 or 4 and greater
array = zarr.create_array(
store,
name="s",
shape=data.shape,
chunks=(chunk_size,chunk_size),
dtype="str", # Does work with "int" or "float", does not work with "str"
)
array[:,:] = data
print(f"written={data}")
print(f"read={array[:,:]}")
assert np.all(array[:,:].astype(int) == data.astype(int)) # Fails
thanks for this example @dtonagel! I'm guessing the culprit is some assumption in the vlen-utf8 codec we are using here