spatialdata icon indicating copy to clipboard operation
spatialdata copied to clipboard

Tolerance when reading corrupted data

Open LucaMarconato opened this issue 1 year ago • 1 comments

If something goes wrong while writing a component of a Zarr store, especially a .zattrs (JSON) file, then the corrupted data could prevent the Zarr store to be read. We could implement a tolerance mechanism that reads as much non-corrupted data as possible and reports to the user what has been detected as corrupted. In this way the user knows what is broken and can manually fix it.

Originally suggested by @aeisenbarth

LucaMarconato avatar Feb 14 '24 15:02 LucaMarconato

Thanks! To expand on the issue:

The leading question is how do I handle corrupted data that users bring to me? The bigger a dataset collection, the greater its value, but also the greater the risk that somewhere a tiny bit can become corrupted.

By corruption I mostly mean:

  • inconsistent JSON file:
    • labels/.zattrs referring to a label name not existing in labels
    • table/table/.zarr referring to a region that is not found
    • table/table/obs/.zarr referring to a column that is not found or has been renamed
    • consolidated zmetadata inconsistent with actually existing elements
  • unreadable JSON file (aborted during write, syntax error, element not recoverable)
  • unreadable binary array data (element not recoverable)

If it's not a top-level file or not required (table region/instances column), other elements should remain readable.

Motivation

spatialdata is the only API able to read SpatialData, so when a store is partially corrupted, we can currently only use external tools to manipulate (or delete) files until a valid state is achieved. One should expect that when handling non-corrupted files, the official API is safer than any external tool.

Feature

The read function should have an optional, "forgiving" read mode where the severety level of read errors can be reduced (to a warning, or maybe a pydantic-like collection of validation errors), so that – with this mode – always SpatialData object is returned that contains at least the valid elements (in the worst case none). Then I can cleanly remove corrupted elements or overwrite them with valid data.

Examples

pandas.read_csv

  • on_bad_lines: {‘error’, ‘warn’, ‘skip’} or Callable, default ‘error’
  • encoding_errors: str, optional, default ‘strict’ with values from standard library codecs which even has repair options

aeisenbarth avatar Feb 14 '24 17:02 aeisenbarth