datasets icon indicating copy to clipboard operation
datasets copied to clipboard

add_column and add_item erroneously(?) require new_fingerprint parameter

Open echthesia opened this issue 2 months ago • 2 comments

Describe the bug

Contradicting their documentation (which doesn't mention the parameter at all), both Dataset.add_column and Dataset.add_item require a new_fingerprint string. This parameter is passed directly to the dataset constructor, which has the fingerprint parameter listed as optional; is there any reason it shouldn't be optional in these methods as well?

Steps to reproduce the bug

Reproduction steps:

  1. Look at the function signature for add_column: https://github.com/huggingface/datasets/blob/17f40a318a1f8c7d33c2a4dd17934f81d14a7f57/src/datasets/arrow_dataset.py#L6078
  2. Repeat for add_item: https://github.com/huggingface/datasets/blob/17f40a318a1f8c7d33c2a4dd17934f81d14a7f57/src/datasets/arrow_dataset.py#L6336

Expected behavior

add_column and add_item should either set the fingerprint parameter to optional or include it in their docstrings

Environment info

Not environment-dependent

echthesia avatar Nov 13 '25 02:11 echthesia

Take this with a grain of salt, this is just my personal understanding: While you technically can overwrite the new_fingerprint with a string, e.g.

t = d.add_column("new_column", col_value, new_fingerprint="dummy_fp")
assert t._fingerprint == "dummy_fp"  # this is true and will pass

this is not desired since the fingerprint should be calculated based on the operations (and their arguments) to be unique. This is handled by the fingerprint_transform function which needs a "new_fingerprint" keyword argument and creates a unique hash if its value is not set, see here. So it is probably safer to not document this keyword, since one doesn't want the user to actually use it and it's only a feature in very limited cases for people really knowing what they are doing. The thing that might be bugging people who read the code is that new_fingerprint seems to be required for add_item and add_column but it is actually set by the decorator (in which's definition it is optional), so maybe changing the signature of add_item and add_column to new_fingerprint: Optional[str] = None would make sense, since this is also how it's handled in the other cases (created by claude):

CloseChoice avatar Nov 24 '25 20:11 CloseChoice

Any updates on this? Current behavior suggests that user is supposed to pass in the new_fingerprint argument, while examples in documentation doesn't include that argument at all.

artpods56 avatar Dec 07 '25 14:12 artpods56