ray WIP: shuffle operation optimization uses smaller dtype for building i…

WIP: shuffle operation optimization uses smaller dtype for building i…

Open alejandroarmas opened this issue 1 year ago • 2 comments

Why are these changes needed?

The problem is limited to the sort and random_shuffle functions, which both build a numpy.array to reorder the rows in a block. Memory overhead exists during index creation if the size of each row is tiny (i.e. not many columns).

Related issue number

Closes #42146

Checks

[x] I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
[x] I've run scripts/format.sh to lint the changes in this PR.
[x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- [x] Unit tests
[ ] Evaluate and Benchmark Two Approaches

Feb 12 '24 23:02 alejandroarmas

This is a WIP as I have not yet implemented the second approach involving in-memory shuffling, nor performed benchmarking showing the performance difference between these two approaches and the original baseline. I'm just waiting for feedback on #42146.

Feb 12 '24 23:02 alejandroarmas

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

Mar 17 '24 08:03 stale[bot]

ray ray copied to clipboard

WIP: shuffle operation optimization uses smaller dtype for building i…

Why are these changes needed?

Related issue number

Checks

ray
ray copied to clipboard