datajoint-python icon indicating copy to clipboard operation
datajoint-python copied to clipboard

FEAT: Object-augmented schemas -- Object Type

Open dimitri-yatsenko opened this issue 4 months ago • 1 comments

Feature Request

Problem

Modern scientific pipelines must manage large, complex data objects (e.g., images, time-series, n-dimensional arrays) that are impractical to store directly in a relational database. The current approach of storing file paths as strings is brittle and error-prone; DataJoint has no awareness of the external file, cannot manage its lifecycle, and cannot guarantee its integrity. This disconnect breaks the seamless nature of the pipeline and places a significant manual burden on the user to maintain data consistency between the database and the external storage.

Requirements

Introduce the object attribute type, which natively supports a hybrid storage model where metadata resides in the database and the data object resides in an external store. This implementation must adhere to the DataJoint 2.0 Specification

Core requirements:

  1. object Attribute Type:
  • [ ] Introduce a new core attribute type named object.
  • [ ] When an attribute is declared as type object, the database table will store a reference key (e.g., path, UUID) and associated metadata, not the data object itself.
  1. dj.Object Interface:
  • [ ] Interfacing with objects stores in objects stores is implemented using the dj.Object base class
  • [ ] The dj.Object base class that users can subclass to define custom handlers for their external data objects.
  • [ ] Project configuration files select and configure the object store
  • [ ] Any class inheriting from dj.Object MUST implement the following standard interface:
  • put(self, store, key: str) -> dict: Writes the object's data to the specified storage backend under a given key and returns a dictionary of metadata to be stored in the database.
  • get(cls, store, key: str) -> "dj.Object": A class method to read data from the store using its key and reconstruct the Python object.
  • get_meta(self) -> dict: Returns a dictionary of metadata about the object instance.
  • verify(self, store, key: str) -> bool: Verifies the existence and integrity (e.g., via checksum) of the object in the external store.

Metadata Management:

  • [ ] For every attribute of type object, the system must automatically store essential metadata in the relational table alongside the object reference.
  • [ ] This metadata MUST include fields for object key/path, file format, size, and a checksum (e.g., MD5, SHA256) to ensure data integrity.

dimitri-yatsenko avatar Aug 18 '25 11:08 dimitri-yatsenko

This issue is stale because it has been open for 45 days with no activity.

github-actions[bot] avatar Oct 03 '25 02:10 github-actions[bot]