datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Add custom fingerprint support to `from_generator`

Open simonreise opened this issue 8 months ago • 1 comments

This PR adds dataset_id_suffix parameter to 'Dataset.from_generator' function.

Dataset.from_generator function passes all of its arguments to BuilderConfig.create_config_id, including generator function itself. BuilderConfig.create_config_id function tries to hash all the args, which can take a large amount of time or even cause MemoryError if the dataset processed in a generator function is large enough.

This PR allows user to pass a custom fingerprint (dataset_id_suffix) to be used as a suffix in a dataset name instead of the one generated by hashing the args.

This PR is a possible solution of #7513

simonreise avatar Apr 23 '25 19:04 simonreise

This is great !

What do you think of passing config_id= directly to the builder instead of just the suffix ? This would be a power user argument though, or for internal use. And in from_generator the new argument can be fingerprint= as in Dataset.__init__()

The config_id can be defined using something like config_id = "default-fingerprint=" + fingerprint

I feel ike this could make the Dataset API more coherent if we avoid introducing a new argument while we can juste use fingerprint=

lhoestq avatar Apr 24 '25 10:04 lhoestq

@lhoestq could you please re-review the changes I made?

simonreise avatar Jul 10 '25 09:07 simonreise

@lhoestq ping I also added a simple test for the fingerprint parameter

simonreise avatar Aug 11 '25 10:08 simonreise

@lhoestq could you please review the PR? I implemented the requested changes and added a test for the added fingerprint arg

simonreise avatar Oct 23 '25 12:10 simonreise

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.