datasets
datasets copied to clipboard
[Fix] Full update of data_dir in DatasetInfo with update_data_dir()
Description
The previous DatasetInfo.update_data_dir()
only updates the datadir in each split, but not its DatasetIdentity, making the update incomplete. The datadir property doesn't have a setter too, thus, this adds a single liner to update the identity when the update_data_dir() is called.
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).
View this failed invocation of the CLA check for more information.
For the most up to date status, view the checks section at the bottom of the pull request.
Thanks @youliangtan for your interest in TFDS!
Do you have an example when the update is incomplete causing issues?
We use this function only once in: https://github.com/tensorflow/datasets/blob/47f7a2e653d1f8c5ae9e41749dfbbf20312c422c/tensorflow_datasets/core/dataset_builder.py#L702-L703 to fix data_dir
in splits, but the identity should already have the right data_dir
after dataset builder was initialized: https://github.com/tensorflow/datasets/blob/47f7a2e653d1f8c5ae9e41749dfbbf20312c422c/tensorflow_datasets/core/dataset_builder.py#L281-L284
Thanks for the reply, I'm using this method to update a new data_dir, after DatasetInfo
is constructed
The code example is here. This script is mainly to merge or reshard the rlds data, maybe there's better way of doing this too. :man_shrugging:
I think DatasetInfo.update_data_dir
should be a private method and only used internally.
For your code I suggest to consider these options :
- Use other tools (for example https://github.com/google/seqio) to combine multiple datasets.
- Dynamically define a dataset builder that gets examples from multiple datasets in
_generate_examples()
. - Construct your
dataset_info
using https://github.com/tensorflow/datasets/blob/47f7a2e653d1f8c5ae9e41749dfbbf20312c422c/tensorflow_datasets/core/dataset_info.py#L101
Thanks for the suggestion. I think I will keep the current code as light as it is for now without depending on external lib.
Also, could the update_data_dir()
be a public method? else it should better to be named as _update_data_dir
:thinking:
I agree, that would be a nice change.
Have you looked into builder_from_directories? (see https://github.com/tensorflow/datasets/blob/303814edf2437e607574d6c128d85ac6cfa6c30a/tensorflow_datasets/core/read_only_builder.py#L152) This merges multiple datasets.