spatialdata icon indicating copy to clipboard operation
spatialdata copied to clipboard

Reading and writing issues on Mac with external disk

Open LouisK92 opened this issue 2 months ago • 0 comments

When writing or copying an sdata.zarr to an external disk (ExFAT format) with Mac some hidden files starting with ._* (so called AppleDouble sidecar files) are generated that lead to an error when reading the zarr.

Write an example zarr:

import numpy as np
import pandas as pd
import spatialdata as sd

data_dir = "/Volumes/path/on/external/drive/"

df = pd.DataFrame({"gene": ["A"]*50 + ["B"]*50, "x": np.arange(100), "y": np.arange(100)})

sdata = sd.SpatialData(points={"points": sd.models.PointsModel.parse(df)}) 

sdata.write(data_dir+"sdata_tmp.zarr", overwrite=True)

When writing directly to the external disk an error occurs, but the writing actually worked. Error log:

....
The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[76], line 11
      7 df = pd.DataFrame({"gene": ["A"]*50 + ["B"]*50, "x": np.arange(100), "y": np.arange(100)})
      9 sdata = sd.SpatialData(points={"points": sd.models.PointsModel.parse(df)}) #, feature_key="gene", chunksize=5000)})
---> 11 sdata.write(data_dir+"sdata_tmp.zarr", overwrite=True)
...
   1446     return self.map[key]
   1447 except self.exceptions as e:
-> 1448     raise KeyError(key) from e

KeyError: '/_/zattrs'

Instead of writing to the external disk the data can also be copied there after writing which leads to the same issue when reading.

Reading the data

sdata = sd.read_zarr(data_dir+"sdata_tmp.zarr")

leads to Error log:

ArrowInvalid                              Traceback (most recent call last)
File ~/miniconda3/envs/g3/lib/python3.11/site-packages/dask/backends.py:140, in CreationDispatch.register_inplace.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    139 try:
--> 140     return func(*args, **kwargs)
    141 except Exception as e:

File ~/miniconda3/envs/g3/lib/python3.11/site-packages/dask/dataframe/io/parquet/core.py:531, in read_parquet(path, columns, filters, categories, index, storage_options, engine, use_nullable_dtypes, dtype_backend, calculate_divisions, ignore_metadata_file, metadata_task_size, split_row_groups, blocksize, aggregate_files, parquet_file_extension, filesystem, **kwargs)
    529     blocksize = None
--> 531 read_metadata_result = engine.read_metadata(
    532     fs,
    533     paths,
    534     categories=categories,
    535     index=index,
    536     use_nullable_dtypes=use_nullable_dtypes,
    537     dtype_backend=dtype_backend,
    538     gather_statistics=calculate_divisions,
    539     filters=filters,
    540     split_row_groups=split_row_groups,
    541     blocksize=blocksize,
    542     aggregate_files=aggregate_files,
    543     ignore_metadata_file=ignore_metadata_file,
    544     metadata_task_size=metadata_task_size,
    545     parquet_file_extension=parquet_file_extension,
    546     dataset=dataset_options,
    547     read=read_options,
    548     **other_options,
    549 )
    551 # In the future, we may want to give the engine the
    552 # option to return a dedicated element for `common_kwargs`.
    553 # However, to avoid breaking the API, we just embed this
    554 # data in the first element of `parts` for now.
    555 # The logic below is intended to handle backward and forward
    556 # compatibility with a user-defined engine.

File ~/miniconda3/envs/g3/lib/python3.11/site-packages/dask/dataframe/io/parquet/arrow.py:546, in ArrowDatasetEngine.read_metadata(cls, fs, paths, categories, index, use_nullable_dtypes, dtype_backend, gather_statistics, filters, split_row_groups, blocksize, aggregate_files, ignore_metadata_file, metadata_task_size, parquet_file_extension, **kwargs)
    545 # Stage 1: Collect general dataset information
--> 546 dataset_info = cls._collect_dataset_info(
    547     paths,
    548     fs,
    549     categories,
    550     index,
    551     gather_statistics,
    552     filters,
    553     split_row_groups,
    554     blocksize,
    555     aggregate_files,
    556     ignore_metadata_file,
    557     metadata_task_size,
    558     parquet_file_extension,
    559     kwargs,
    560 )
    562 # Stage 2: Generate output `meta`

File ~/miniconda3/envs/g3/lib/python3.11/site-packages/dask/dataframe/io/parquet/arrow.py:1061, in ArrowDatasetEngine._collect_dataset_info(cls, paths, fs, categories, index, gather_statistics, filters, split_row_groups, blocksize, aggregate_files, ignore_metadata_file, metadata_task_size, parquet_file_extension, kwargs)
   1060 if ds is None:
-> 1061     ds = pa_ds.dataset(
   1062         paths,
   1063         filesystem=_wrapped_fs(fs),
   1064         **_processed_dataset_kwargs,
   1065     )
   1067 # Get file_frag sample and extract physical_schema

File ~/miniconda3/envs/g3/lib/python3.11/site-packages/pyarrow/dataset.py:797, in dataset(source, schema, format, filesystem, partitioning, partition_base_dir, exclude_invalid_files, ignore_prefixes)
    796 if all(_is_path_like(elem) or isinstance(elem, FileInfo) for elem in source):
...
    150 else:
--> 151     raise exc from e

ArrowInvalid: An error occurred while calling the read_parquet method registered to the pandas backend.
Original Message: Error creating dataset. Could not read schema from '/Volumes/Sandisk2TB/G3_temp/data/sdata_tmp.zarr/points/points/points.parquet/._part.0.parquet'. Is this a 'parquet' file?: Could not open Parquet input source '/Volumes/Sandisk2TB/G3_temp/data/sdata_tmp.zarr/points/points/points.parquet/._part.0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

Solution

The zarr can be loaded after removing the ._* files manually:

find "/Volumes/path/on/external/drive/sdata_tmp.zarr" -name '._*' -type f -delete

LouisK92 avatar Nov 04 '25 09:11 LouisK92