ray icon indicating copy to clipboard operation
ray copied to clipboard

WIP [data] A streaming compatible implementation of repartition-by-column

Open wingkitlee0 opened this issue 1 year ago • 3 comments

WIP

tl;dr: For large partitioned dataset with continuous group, we can avoid groupby.map_groups (and sort within) by using repartition-by-column. See #42288

TODO:

  • [ ] finalize the public API
  • [ ] handle concurrency
  • [ ] add unit tests

Why are these changes needed?

Related issue number

Closes #42288

Checks

  • [ ] I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • [ ] I've run scripts/format.sh to lint the changes in this PR.
  • [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
    • [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in doc/source/tune/api/ under the corresponding .rst file.
  • [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • [ ] Unit tests
    • [ ] Release tests
    • [ ] This PR is not tested :(

wingkitlee0 avatar Jan 18 '24 03:01 wingkitlee0

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

  • If you'd like to keep this open, just leave any comment, and the stale label will be removed.

stale[bot] avatar Mar 17 '24 08:03 stale[bot]

This feature is great! May I ask when this PR can be merged? @wingkitlee0

MissiontoMars avatar Apr 09 '24 12:04 MissiontoMars

This feature is great! May I ask when this PR can be merged? @wingkitlee0

Thanks for the interest. Probably need a little bit more work finalize stuff... There is no timeline as I am not actively working on this.

wingkitlee0 avatar Apr 10 '24 21:04 wingkitlee0