ray
ray copied to clipboard
[data] fix np.array crash the allocate mem error when souce include short an…
Issue: https://github.com/ray-project/ray/issues/46293
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 414. GiB for an array with shape (4900,) and data type <U22697406
type(udf_return_col)=<class 'list'> len(udf_return_col)=4900
type(udf_return_col[0])=<class 'str'> len(udf_return_col[0])=2576
Why are these changes needed?
Related issue number
Checks
- [ ] I've signed off every commit(by using the -s flag, i.e.,
git commit -s) in this PR. - [ ] I've run
scripts/format.shto lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I added a
method in Tune, I've added it in
doc/source/tune/api/under the corresponding.rstfile.
- [ ] I've added any new APIs to the API Reference. For example, if I added a
method in Tune, I've added it in
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [ ] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(
Time consuming test
>>> import time
>>> st=time.time(); x=np.array(['Hello'] * 100 + ['\r\ns'*10000000] , dtype=np.dtype('str')); print('use:', time.time() - st)
use: 9.69443154335022
>>> st=time.time(); x=np.array(['Hello'] * 100 + ['\r\ns'*10000000] , dtype=np.dtype('O')); print('use:', time.time() - st)
use: 1.9776394367218018
>>> st=time.time(); x=np.array(['Hello'] * 100 + ['\r\ns'*10000000] , dtype=np.dtype('O')); print('use:', time.time() - st)
use: 0.029134511947631836
>>> st=time.time(); x=np.array(['Hello'] * 100 + ['\r\ns'*10000000] , dtype=np.dtype('O')); print('use:', time.time() - st)
use: 0.005353212356567383
>>> st=time.time(); x=np.array(['Hello'] * 100 + ['\r\ns'*10000000] , dtype=np.dtype('str')); print('use:', time.time() - st)
use: 9.803117036819458
>>> st=time.time(); x=np.array(['Hello'] * 100 + ['\r\ns'*10000000] ); print('use:', time.time() - st)
use: 11.640169858932495
>>> st=time.time(); x=np.array(['Hello'] * 100 + ['\r\ns'*10000000] ); print('use:', time.time() - st)
use: 11.6232750415802
Hi @Ox0400 - could you also provide a reproducible script we can test against?
Hi, I'm going to close this PR since it's outdated and unfortunately it's not clear what the end-user issue is.
@richardliaw https://github.com/ray-project/ray/issues/46293