datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Reimplemented partial split download support (revival of #6832)

Open ArjunJagdale opened this issue 5 months ago • 5 comments

(revival of #6832)

https://github.com/huggingface/datasets/pull/7648#issuecomment-3084050130

Close https://github.com/huggingface/datasets/issues/4101, and more


PR under work!!!!

ArjunJagdale avatar Jul 28 '25 19:07 ArjunJagdale

Mario’s Patch (in PR #6832):

def _make_split_generators_kwargs(self, prepare_split_kwargs):
    # Pass `pipeline` into `_split_generators()` from `prepare_split_kwargs` if
    # it's in the call signature of `_split_generators()`.
    # This allows for global preprocessing in beam.
    split_generators_kwargs = {}
    if "pipeline" in inspect.signature(self._split_generators).parameters:
        split_generators_kwargs["pipeline"] = prepare_split_kwargs["pipeline"]
    split_generators_kwargs.update(super()._make_split_generators_kwargs(prepare_split_kwargs))
    return split_generators_kwargs

In the latest main(in my fork and og repo's main):

def _make_split_generators_kwargs(self, prepare_split_kwargs):
    """Get kwargs for `self._split_generators()` from `prepare_split_kwargs`."""
    splits = prepare_split_kwargs.pop("splits", None)
    if self._supports_partial_generation():
        return {"splits": splits}
    return {}

It enables passing splits into _split_generators() only for builders that support it(if i am not wrong..). So ignored Beam logic for now!

ArjunJagdale avatar Jul 28 '25 19:07 ArjunJagdale

Awesome ! btw we can modify the GeneratorBasedBuilder and ArrowBasedBuilder if needed now that custom loading scripts are not supported anymore :)

I'll review this in a bit

lhoestq avatar Sep 04 '25 10:09 lhoestq

@lhoestq @ArjunJagdale is this still work in progress or is just a review missing? Anything I can help with here? This would indeed be a cool feature

CloseChoice avatar Oct 28 '25 15:10 CloseChoice

I did a preliminary pass and it looks good but we should check the CI, could you run make style @ArjunJagdale so we can run the CI ?

lhoestq avatar Oct 28 '25 16:10 lhoestq

Done! Also some parts may be incomplete because I had to focus on important exams and semester activities so couldn’t finish the work fully. I will still try my best.

ArjunJagdale avatar Oct 29 '25 10:10 ArjunJagdale