hdmf icon indicating copy to clipboard operation
hdmf copied to clipboard

[Feature]: Document and Add Control Flag for Automatic Object Identity-Based Deduplication

Open bendichter opened this issue 4 months ago • 3 comments

Summary

HDMF currently performs automatic object identity-based deduplication when the same Python object is referenced multiple times in a hierarchical data structure. This behavior creates soft links instead of duplicating data, but it is undocumented and cannot be controlled by users.

Current Behavior

When the same Python Data or Container object is used in multiple locations within a hierarchical structure:

  1. The first occurrence is stored as a full dataset/group
  2. Subsequent references become soft links to the original location
  3. This happens automatically based on Python object identity (id(obj))

This could become an issue if the user wants to edit one of the objects but not the other.

Example:

import numpy as np
from hdmf import Container, Data

# Same data object used in multiple places
shared_data = Data(name="shared", data=np.array([1, 2, 3, 4, 5]))

container1 = SomeContainer(name="container1", data=shared_data)
container2 = SomeContainer(name="container2", data=shared_data)  # Will become soft link

# Results in HDF5 file:
# /container1/data -> actual dataset
# /container2/data -> soft link to /container1/data

Issues with Current Implementation

1. Undocumented Behavior

  • This behavior is not documented in user guides
  • Users may be surprised by soft links appearing in their files
  • The distinction between object identity vs. content equality is not clear

2. No User Control

  • Users cannot disable this behavior if they want separate copies
  • No way to force duplication even when using the same object
  • Behavior is always implicit based on object identity

3. Schema Documentation Gap

  • HDMF schema documentation doesn't mention this linking behavior

It is important for any downstream tool developers to know that they need to handle soft links anywhere a identical dataset might be stored.

What solution would you like?

  1. I would like to update documentation
  2. I would like control over whether this happens:
# In HDF5IO.write() and related methods
io.write(container, deduplicate_objects=True)  # Current default behavior
io.write(container, deduplicate_objects=False) # Force separate copies

# Or in BuildManager
manager.build(container, deduplicate_objects=False)

Do you have any interest in helping implement the feature?

Yes.

bendichter avatar Aug 14 '25 15:08 bendichter

This makes sense to add as an option. Certainly we should document the current behavior. Could you elaborate on the use case so I understand better when this would be used? You said that a user writes the same container to two different locations in the file, and they expect that that they (or someone else) will edit only one of those instances later. When would this happen?

rly avatar Aug 14 '25 17:08 rly

Creating a flag on HDMFIO.write seems reasonable. Similar to how we currently have the flag link_data on HDF5IO.write

rly avatar Aug 14 '25 17:08 rly

@rly I can imagine a use-case but I don't have a real one atm. I was mostly surprised that we use links in this way when the schema says nothing about this being allowed as far as I know, and I don't think it's documented either. Matnwb does not behave this way, so it's a divergence in the software. Not a huge issue, but should at least be documented.

bendichter avatar Aug 15 '25 03:08 bendichter