ray icon indicating copy to clipboard operation
ray copied to clipboard

[data] optimize parquet datasource split tasks algorithm

Open Jay-ju opened this issue 1 year ago • 5 comments

Why are these changes needed?

Using np.split to split a list in NumPy is not as performant as not using split_list, and the performance difference can be significant.

int_array = range(1000003)
start = time.perf_counter()

for split in np.array_split(int_array, 100):
    len(split)

end1 = time.perf_counter()
print(f"np array_split 100w elements to 100 , cost {(end1 - start)*1000} ms")

for split in _split_list(int_array, 100):
    len(split)
end2 = time.perf_counter()
print(f"_split_list 100w elements to 100, {(end2 - end1)*1000} ms")

result is: image

Related issue number

Checks

  • [ ] I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • [x] I've run scripts/format.sh to lint the changes in this PR.
  • [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
    • [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in doc/source/tune/api/ under the corresponding .rst file.
  • [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • [x] Unit tests
    • [ ] Release tests
    • [ ] This PR is not tested :(

Jay-ju avatar Oct 09 '24 01:10 Jay-ju

thanks for your contribution. Could you elaborate on why np.array_split would produce uneven and non-deterministic results? I tried the following script, the splits are even.

import numpy as np

arr = list(range(10000))
for split in np.array_split(arr, 100):
    print(len(split))

raulchen avatar Oct 10 '24 23:10 raulchen

thanks for your contribution. Could you elaborate on why np.array_split would produce uneven and non-deterministic results? I tried the following script, the splits are even.

import numpy as np

arr = list(range(10000))
for split in np.array_split(arr, 100):
    print(len(split))

sorry, description is confused, I have provided a benchmark here.

Jay-ju avatar Oct 11 '24 02:10 Jay-ju

do you know why np.array_split is slow? Also, I'm curious about your use cases. This operation is done only once upon job start-up. 10s of ms latency increase doesn't sound like the big deal. Why would it matter in your case? Asking because _split_list is no longer used else where. We should consider removing it to reduce maintenance overhead, unless there is a strong reason.

raulchen avatar Oct 16 '24 00:10 raulchen

do you know why np.array_split is slow? Also, I'm curious about your use cases. This operation is done only once upon job start-up. 10s of ms latency increase doesn't sound like the big deal. Why would it matter in your case? Asking because _split_list is no longer used else where. We should consider removing it to reduce maintenance overhead, unless there is a strong reason.

do you know why np.array_split is slow? Also, I'm curious about your use cases. This operation is done only once upon job start-up. 10s of ms latency increase doesn't sound like the big deal. Why would it matter in your case? Asking because _split_list is no longer used else where. We should consider removing it to reduce maintenance overhead, unless there is a strong reason.

It was only discovered during single-machine testing.

Jay-ju avatar Oct 16 '24 05:10 Jay-ju

then maybe let's not introduce this change and just remove _split_list

raulchen avatar Oct 18 '24 01:10 raulchen

then maybe let's not introduce this change and just remove _split_list

done

Jay-ju avatar Dec 02 '24 05:12 Jay-ju