add_column and add_item erroneously(?) require new_fingerprint parameter
Describe the bug
Contradicting their documentation (which doesn't mention the parameter at all), both Dataset.add_column and Dataset.add_item require a new_fingerprint string. This parameter is passed directly to the dataset constructor, which has the fingerprint parameter listed as optional; is there any reason it shouldn't be optional in these methods as well?
Steps to reproduce the bug
Reproduction steps:
- Look at the function signature for add_column: https://github.com/huggingface/datasets/blob/17f40a318a1f8c7d33c2a4dd17934f81d14a7f57/src/datasets/arrow_dataset.py#L6078
- Repeat for add_item: https://github.com/huggingface/datasets/blob/17f40a318a1f8c7d33c2a4dd17934f81d14a7f57/src/datasets/arrow_dataset.py#L6336
Expected behavior
add_column and add_item should either set the fingerprint parameter to optional or include it in their docstrings
Environment info
Not environment-dependent
Take this with a grain of salt, this is just my personal understanding: While you technically can overwrite the new_fingerprint with a string, e.g.
t = d.add_column("new_column", col_value, new_fingerprint="dummy_fp")
assert t._fingerprint == "dummy_fp" # this is true and will pass
this is not desired since the fingerprint should be calculated based on the operations (and their arguments) to be unique. This is handled by the fingerprint_transform function which needs a "new_fingerprint" keyword argument and creates a unique hash if its value is not set, see here. So it is probably safer to not document this keyword, since one doesn't want the user to actually use it and it's only a feature in very limited cases for people really knowing what they are doing. The thing that might be bugging people who read the code is that new_fingerprint seems to be required for add_item and add_column but it is actually set by the decorator (in which's definition it is optional), so maybe changing the signature of add_item and add_column to new_fingerprint: Optional[str] = None would make sense, since this is also how it's handled in the other cases (created by claude):
- flatten
- cast_column
- remove_columns
- rename_column
- rename_columns
- select_columns
- batch
- filter
- flatten_indices
- select
- _select_contiguous
- sort
- shuffle
-
train_test_split
So as you mentioned, I believe the methods erronously require the
new_fingerprintparameter and making them optional is a little consistency win.
Any updates on this? Current behavior suggests that user is supposed to pass in the new_fingerprint argument, while examples in documentation doesn't include that argument at all.