WIP [data] A streaming compatible implementation of repartition-by-column
WIP
tl;dr: For large partitioned dataset with continuous group, we can avoid groupby.map_groups (and sort within) by using repartition-by-column. See #42288
TODO:
- [ ] finalize the public API
- [ ] handle
concurrency - [ ] add unit tests
Why are these changes needed?
Related issue number
Closes #42288
Checks
- [ ] I've signed off every commit(by using the -s flag, i.e.,
git commit -s) in this PR. - [ ] I've run
scripts/format.shto lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I added a
method in Tune, I've added it in
doc/source/tune/api/under the corresponding.rstfile.
- [ ] I've added any new APIs to the API Reference. For example, if I added a
method in Tune, I've added it in
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [ ] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
- If you'd like to keep this open, just leave any comment, and the stale label will be removed.
This feature is great! May I ask when this PR can be merged? @wingkitlee0
This feature is great! May I ask when this PR can be merged? @wingkitlee0
Thanks for the interest. Probably need a little bit more work finalize stuff... There is no timeline as I am not actively working on this.