Add custom fingerprint support to `from_generator`
This PR adds dataset_id_suffix parameter to 'Dataset.from_generator' function.
Dataset.from_generator function passes all of its arguments to BuilderConfig.create_config_id, including generator function itself. BuilderConfig.create_config_id function tries to hash all the args, which can take a large amount of time or even cause MemoryError if the dataset processed in a generator function is large enough.
This PR allows user to pass a custom fingerprint (dataset_id_suffix) to be used as a suffix in a dataset name instead of the one generated by hashing the args.
This PR is a possible solution of #7513
This is great !
What do you think of passing config_id= directly to the builder instead of just the suffix ? This would be a power user argument though, or for internal use. And in from_generator the new argument can be fingerprint= as in Dataset.__init__()
The config_id can be defined using something like config_id = "default-fingerprint=" + fingerprint
I feel ike this could make the Dataset API more coherent if we avoid introducing a new argument while we can juste use fingerprint=
@lhoestq could you please re-review the changes I made?
@lhoestq ping
I also added a simple test for the fingerprint parameter
@lhoestq could you please review the PR? I implemented the requested changes and added a test for the added fingerprint arg
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.