spatialdata
spatialdata copied to clipboard
Reading and writing issues on Mac with external disk
When writing or copying an sdata.zarr to an external disk (ExFAT format) with Mac some hidden files starting with ._* (so called AppleDouble sidecar files) are generated that lead to an error when reading the zarr.
Write an example zarr:
import numpy as np
import pandas as pd
import spatialdata as sd
data_dir = "/Volumes/path/on/external/drive/"
df = pd.DataFrame({"gene": ["A"]*50 + ["B"]*50, "x": np.arange(100), "y": np.arange(100)})
sdata = sd.SpatialData(points={"points": sd.models.PointsModel.parse(df)})
sdata.write(data_dir+"sdata_tmp.zarr", overwrite=True)
When writing directly to the external disk an error occurs, but the writing actually worked. Error log:
....
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
Cell In[76], line 11
7 df = pd.DataFrame({"gene": ["A"]*50 + ["B"]*50, "x": np.arange(100), "y": np.arange(100)})
9 sdata = sd.SpatialData(points={"points": sd.models.PointsModel.parse(df)}) #, feature_key="gene", chunksize=5000)})
---> 11 sdata.write(data_dir+"sdata_tmp.zarr", overwrite=True)
...
1446 return self.map[key]
1447 except self.exceptions as e:
-> 1448 raise KeyError(key) from e
KeyError: '/_/zattrs'
Instead of writing to the external disk the data can also be copied there after writing which leads to the same issue when reading.
Reading the data
sdata = sd.read_zarr(data_dir+"sdata_tmp.zarr")
leads to Error log:
ArrowInvalid Traceback (most recent call last)
File ~/miniconda3/envs/g3/lib/python3.11/site-packages/dask/backends.py:140, in CreationDispatch.register_inplace.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
139 try:
--> 140 return func(*args, **kwargs)
141 except Exception as e:
File ~/miniconda3/envs/g3/lib/python3.11/site-packages/dask/dataframe/io/parquet/core.py:531, in read_parquet(path, columns, filters, categories, index, storage_options, engine, use_nullable_dtypes, dtype_backend, calculate_divisions, ignore_metadata_file, metadata_task_size, split_row_groups, blocksize, aggregate_files, parquet_file_extension, filesystem, **kwargs)
529 blocksize = None
--> 531 read_metadata_result = engine.read_metadata(
532 fs,
533 paths,
534 categories=categories,
535 index=index,
536 use_nullable_dtypes=use_nullable_dtypes,
537 dtype_backend=dtype_backend,
538 gather_statistics=calculate_divisions,
539 filters=filters,
540 split_row_groups=split_row_groups,
541 blocksize=blocksize,
542 aggregate_files=aggregate_files,
543 ignore_metadata_file=ignore_metadata_file,
544 metadata_task_size=metadata_task_size,
545 parquet_file_extension=parquet_file_extension,
546 dataset=dataset_options,
547 read=read_options,
548 **other_options,
549 )
551 # In the future, we may want to give the engine the
552 # option to return a dedicated element for `common_kwargs`.
553 # However, to avoid breaking the API, we just embed this
554 # data in the first element of `parts` for now.
555 # The logic below is intended to handle backward and forward
556 # compatibility with a user-defined engine.
File ~/miniconda3/envs/g3/lib/python3.11/site-packages/dask/dataframe/io/parquet/arrow.py:546, in ArrowDatasetEngine.read_metadata(cls, fs, paths, categories, index, use_nullable_dtypes, dtype_backend, gather_statistics, filters, split_row_groups, blocksize, aggregate_files, ignore_metadata_file, metadata_task_size, parquet_file_extension, **kwargs)
545 # Stage 1: Collect general dataset information
--> 546 dataset_info = cls._collect_dataset_info(
547 paths,
548 fs,
549 categories,
550 index,
551 gather_statistics,
552 filters,
553 split_row_groups,
554 blocksize,
555 aggregate_files,
556 ignore_metadata_file,
557 metadata_task_size,
558 parquet_file_extension,
559 kwargs,
560 )
562 # Stage 2: Generate output `meta`
File ~/miniconda3/envs/g3/lib/python3.11/site-packages/dask/dataframe/io/parquet/arrow.py:1061, in ArrowDatasetEngine._collect_dataset_info(cls, paths, fs, categories, index, gather_statistics, filters, split_row_groups, blocksize, aggregate_files, ignore_metadata_file, metadata_task_size, parquet_file_extension, kwargs)
1060 if ds is None:
-> 1061 ds = pa_ds.dataset(
1062 paths,
1063 filesystem=_wrapped_fs(fs),
1064 **_processed_dataset_kwargs,
1065 )
1067 # Get file_frag sample and extract physical_schema
File ~/miniconda3/envs/g3/lib/python3.11/site-packages/pyarrow/dataset.py:797, in dataset(source, schema, format, filesystem, partitioning, partition_base_dir, exclude_invalid_files, ignore_prefixes)
796 if all(_is_path_like(elem) or isinstance(elem, FileInfo) for elem in source):
...
150 else:
--> 151 raise exc from e
ArrowInvalid: An error occurred while calling the read_parquet method registered to the pandas backend.
Original Message: Error creating dataset. Could not read schema from '/Volumes/Sandisk2TB/G3_temp/data/sdata_tmp.zarr/points/points/points.parquet/._part.0.parquet'. Is this a 'parquet' file?: Could not open Parquet input source '/Volumes/Sandisk2TB/G3_temp/data/sdata_tmp.zarr/points/points/points.parquet/._part.0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
Solution
The zarr can be loaded after removing the ._* files manually:
find "/Volumes/path/on/external/drive/sdata_tmp.zarr" -name '._*' -type f -delete