Reimplemented partial split download support (revival of #6832)
(revival of #6832)
https://github.com/huggingface/datasets/pull/7648#issuecomment-3084050130
Close https://github.com/huggingface/datasets/issues/4101, and more
PR under work!!!!
Mario’s Patch (in PR #6832):
def _make_split_generators_kwargs(self, prepare_split_kwargs):
# Pass `pipeline` into `_split_generators()` from `prepare_split_kwargs` if
# it's in the call signature of `_split_generators()`.
# This allows for global preprocessing in beam.
split_generators_kwargs = {}
if "pipeline" in inspect.signature(self._split_generators).parameters:
split_generators_kwargs["pipeline"] = prepare_split_kwargs["pipeline"]
split_generators_kwargs.update(super()._make_split_generators_kwargs(prepare_split_kwargs))
return split_generators_kwargs
In the latest main(in my fork and og repo's main):
def _make_split_generators_kwargs(self, prepare_split_kwargs):
"""Get kwargs for `self._split_generators()` from `prepare_split_kwargs`."""
splits = prepare_split_kwargs.pop("splits", None)
if self._supports_partial_generation():
return {"splits": splits}
return {}
It enables passing splits into _split_generators() only for builders that support it(if i am not wrong..). So ignored Beam logic for now!
Awesome ! btw we can modify the GeneratorBasedBuilder and ArrowBasedBuilder if needed now that custom loading scripts are not supported anymore :)
I'll review this in a bit
@lhoestq @ArjunJagdale is this still work in progress or is just a review missing? Anything I can help with here? This would indeed be a cool feature
I did a preliminary pass and it looks good but we should check the CI, could you run make style @ArjunJagdale so we can run the CI ?
Done! Also some parts may be incomplete because I had to focus on important exams and semester activities so couldn’t finish the work fully. I will still try my best.