zarr-python icon indicating copy to clipboard operation
zarr-python copied to clipboard

Wrong array write when writing chunked array from numpy string data with "F" order

Open dtonagel opened this issue 2 months ago • 5 comments

Zarr version

v3.1.3

Numcodecs version

v0.16.3

Python Version

3.12.7, 3.13.9

Operating System

Windows

Installation

pip into virtual environment

Description

I have a pandas DataFrame storing string data with string[pyarrow] datatype.

When converting this data to numpy array with either "df.values" or "df.to_numpy()" it produces a numpy "object" array (dtype=="O").

When storing this array in a chunked zarr-array there seems to be a strange "off-by-one" error either when writing or when reading. The result of the array read back from zarr differs from the original array.

When the DataFrame was stored as normal (python) string, this does not happen even though the numpy array looks exactly the same.

Steps to reproduce

# /// script
# requires-python = ">=3.12"
# dependencies = [
#   "zarr@git+https://github.com/zarr-developers/zarr-python.git@main",
#   "pandas",
#   "pyarrow",
# ]
# ///
#
# This script automatically imports the development branch of zarr to check for issues

import zarr
# your reproducer code
import numpy as np
import pandas as pd

# Number of rows and columns
num_rows = 11
num_cols = 11

# Generate incrementing strings
incrementing_strings = [f"S{i:05}" for i in range(1, num_rows * num_cols + 1)]

npdata = np.array(incrementing_strings, dtype="str").reshape(num_rows, num_cols)

df = pd.DataFrame(npdata)
dfpa = df.astype("string[pyarrow]")

data = df.values
datapa = dfpa.values

# According to numpy, the converted data is identical:
assert data.dtype == datapa.dtype  # dtype == 'O'
assert data.shape == datapa.shape
assert data.size == datapa.size
assert np.all(data == datapa)

# Now store both in a zarray (using the same array here but doesn't matter if I use a fresh one)
store = zarr.storage.MemoryStore()  # Or LocalStore, doesn't matter

array = zarr.create_array(
                store,
                name="s",
                shape=data.shape,
                chunks=(10,10),
                fill_value="<NA>",
                dtype=str,
            )

# This works as expected
array[:,:] = data
zdata = array[:,:]
assert np.all(zdata == data)

# This doesn't
array[:,:] = datapa
zdata = array[:,:]
print(f"written={datapa[0,:4]}")  # ['S00001' 'S00002' 'S00003' 'S00004']
print(f"read={zdata[0,:4]}")        # ['S00001' 'S00012' 'S00023' 'S00034']
assert np.all(zdata == datapa)  # Fails

Additional output

No response

dtonagel avatar Oct 28 '25 14:10 dtonagel

The problem seems to be on the write side because the generated chunk files are different when using LocalStore ("0" chunk has 256 bytes in the pa version versus 200 with the non-pa-strings)

dtonagel avatar Oct 28 '25 15:10 dtonagel

For what its worth: The memory layout of the two numpy arrays is different:

*** data strides and flags: (88, 8) C_CONTIGUOUS : True F_CONTIGUOUS : False OWNDATA : False WRITEABLE : True ALIGNED : True WRITEBACKIFCOPY : False

*** datapa strides and flags (8, 88) C_CONTIGUOUS : False F_CONTIGUOUS : True OWNDATA : False WRITEABLE : True ALIGNED : True WRITEBACKIFCOPY : False

dtonagel avatar Oct 28 '25 15:10 dtonagel

If I use datapa = np.asarray(dfpa.values, order="C") the problem disappears, so looks definitely like a handling problem with the "F" memory layout

dtonagel avatar Oct 28 '25 16:10 dtonagel

I modified the example to remove pandas and pyarrow since they are not relevant any more:

# /// script
# requires-python = ">=3.12"
# dependencies = [
#   "zarr@git+https://github.com/zarr-developers/zarr-python.git@main",
# ]
# ///
#
# This script automatically imports the development branch of zarr to check for issues
import zarr
import numpy as np

ndata = np.arange(9).reshape(3,3)
data = np.asarray(ndata, order="F")  # does work with "C"

store = zarr.storage.MemoryStore()  # Or LocalStore, doesn't matter

chunk_size = 2 # Doesn't work with 2 or 3 but works for 1 or 4 and greater
array = zarr.create_array(
                store,
                name="s",
                shape=data.shape,
                chunks=(chunk_size,chunk_size),
                dtype="str",  # Does work with "int" or "float", does not work with "str"
            )

array[:,:] = data
print(f"written={data}")
print(f"read={array[:,:]}")

assert np.all(array[:,:].astype(int) == data.astype(int))  # Fails

dtonagel avatar Oct 28 '25 16:10 dtonagel

thanks for this example @dtonagel! I'm guessing the culprit is some assumption in the vlen-utf8 codec we are using here

d-v-b avatar Oct 28 '25 16:10 d-v-b