[data] optimize parquet datasource split tasks algorithm
Why are these changes needed?
Using np.split to split a list in NumPy is not as performant as not using split_list, and the performance difference can be significant.
int_array = range(1000003)
start = time.perf_counter()
for split in np.array_split(int_array, 100):
len(split)
end1 = time.perf_counter()
print(f"np array_split 100w elements to 100 , cost {(end1 - start)*1000} ms")
for split in _split_list(int_array, 100):
len(split)
end2 = time.perf_counter()
print(f"_split_list 100w elements to 100, {(end2 - end1)*1000} ms")
result is:
Related issue number
Checks
- [ ] I've signed off every commit(by using the -s flag, i.e.,
git commit -s) in this PR. - [x] I've run
scripts/format.shto lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I added a
method in Tune, I've added it in
doc/source/tune/api/under the corresponding.rstfile.
- [ ] I've added any new APIs to the API Reference. For example, if I added a
method in Tune, I've added it in
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [x] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(
thanks for your contribution. Could you elaborate on why np.array_split would produce uneven and non-deterministic results? I tried the following script, the splits are even.
import numpy as np
arr = list(range(10000))
for split in np.array_split(arr, 100):
print(len(split))
thanks for your contribution. Could you elaborate on why np.array_split would produce uneven and non-deterministic results? I tried the following script, the splits are even.
import numpy as np arr = list(range(10000)) for split in np.array_split(arr, 100): print(len(split))
sorry, description is confused, I have provided a benchmark here.
do you know why np.array_split is slow? Also, I'm curious about your use cases. This operation is done only once upon job start-up. 10s of ms latency increase doesn't sound like the big deal. Why would it matter in your case? Asking because _split_list is no longer used else where. We should consider removing it to reduce maintenance overhead, unless there is a strong reason.
do you know why np.array_split is slow? Also, I'm curious about your use cases. This operation is done only once upon job start-up. 10s of ms latency increase doesn't sound like the big deal. Why would it matter in your case? Asking because _split_list is no longer used else where. We should consider removing it to reduce maintenance overhead, unless there is a strong reason.
do you know why np.array_split is slow? Also, I'm curious about your use cases. This operation is done only once upon job start-up. 10s of ms latency increase doesn't sound like the big deal. Why would it matter in your case? Asking because _split_list is no longer used else where. We should consider removing it to reduce maintenance overhead, unless there is a strong reason.
It was only discovered during single-machine testing.
then maybe let's not introduce this change and just remove _split_list
then maybe let's not introduce this change and just remove _split_list
done