ray
ray copied to clipboard
Revert "Revert "[Core] Support Arrow zerocopy serialization in object…
… store (#35110)" (#36000)"
This reverts commit 822904b312423010682fa92f4085c05e08819337.
Why are these changes needed?
Last PR was reverted since test_advanced_9
failed. However, I tested test_advanced_9
on local device and it passed.
This assert failure looks unrelated to this PR.
https://github.com/ray-project/ray/blob/0b190ee1160eeca9796bc091e07eaebf4c85b511/python/ray/tests/test_advanced_9.py#L426-L430
Related issue number
Checks
- [ ] I've signed off every commit(by using the -s flag, i.e.,
git commit -s
) in this PR. - [ ] I've run
scripts/format.sh
to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I added a
method in Tune, I've added it in
doc/source/tune/api/
under the corresponding.rst
file.
- [ ] I've added any new APIs to the API Reference. For example, if I added a
method in Tune, I've added it in
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [ ] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(
It's weird test_advanced_9
of (Medium A-J) failed in CI but worked well on my local device..
Gentle ping @ericl @rkooo567 for help, thanks~
Not really sure, but perhaps try disabling parts of your PR to see what is causing the issue? It may be worth adding more debug logs for where OMP_NUM_THREADS is being set as well.
Could we also tag other reviews who have context on the original PR for review? @Deegue
And let me know if there's anything I can help or I don't think I am the best person to approve it.
can you let me know what were the major changes from the PR that was merged?
Could we also tag other reviews who have context on the original PR for review? @Deegue
And let me know if there's anything I can help or I don't think I am the best person to approve it.
Ok, here is the original PR https://github.com/ray-project/ray/pull/35110 and gentle ping @ericl for review since the previous failed case was passed.
Thanks again @rickyyx , could you help to find the fail reason of Arrow 6? I'm not sure if it's because of import pyarrow
since there is no stacktrace.
can you let me know what were the major changes from the PR that was merged?
Thanks for comment, the major change is using arrow serialization for object store read and write, which is zero-copy and brings better performance.
I got this stacktrace from one of the workers in the failed test for arrow 6
*** SIGSEGV received at time=1689148296 on cpu 1 ***
PC: @ 0x7fdda5605b0c (unknown) arrow::(anonymous namespace)::NullArrayFactory::CreateChild()
@ 0x7fdde91d81d5 208 absl::lts_20220623::WriteFailureInfo()
@ 0x7fdde91d7f18 64 absl::lts_20220623::AbslFailureSignalHandler()
@ 0x7fddea56c420 (unknown) (unknown)
@ 0x7fddea2a45f3 160 (unknown)
@ 0x7fdda56064a7 208 arrow::VisitTypeInline<>()
@ ... and at least 1 more frames
[2023-07-12 07:51:36,709 E 19106 19106] logging.cc:361: *** SIGSEGV received at time=1689148296 on cpu 1 ***
[2023-07-12 07:51:36,709 E 19106 19106] logging.cc:361: PC: @ 0x7fdda5605b0c (unknown) arrow::(anonymous namespace)::NullArrayFactory::CreateChild()
[2023-07-12 07:51:36,709 E 19106 19106] logging.cc:361: @ 0x7fdde91d81d5 208 absl::lts_20220623::WriteFailureInfo()
[2023-07-12 07:51:36,710 E 19106 19106] logging.cc:361: @ 0x7fdde91d7f31 64 absl::lts_20220623::AbslFailureSignalHandler()
[2023-07-12 07:51:36,710 E 19106 19106] logging.cc:361: @ 0x7fddea56c420 (unknown) (unknown)
[2023-07-12 07:51:36,711 E 19106 19106] logging.cc:361: @ 0x7fddea2a45f3 160 (unknown)
[2023-07-12 07:51:36,711 E 19106 19106] logging.cc:361: @ 0x7fdda56064a7 208 arrow::VisitTypeInline<>()
[2023-07-12 07:51:36,711 E 19106 19106] logging.cc:361: @ ... and at least 1 more frames
Fatal Python error: Segmentation fault
Stack (most recent call first):
File "/opt/miniconda/lib/python3.8/site-packages/pyarrow/compute.py", line 625 in take
File "/ray/python/ray/data/_internal/arrow_ops/transform_pyarrow.py", line 41 in take_table
File "/ray/python/ray/data/_internal/arrow_block.py", line 320 in take
File "/ray/python/ray/data/_internal/arrow_block.py", line 220 in random_shuffle
File "/ray/python/ray/data/_internal/shuffle_and_partition.py", line 88 in reduce
File "/ray/python/ray/_private/worker.py", line 777 in main_loop
File "/ray/python/ray/_private/workers/default_worker.py", line 262 in <module>
@rickyyx I'm helping to fix the CI. The arrow related tests are passed. Other failures seems unrelated?
Hi @ericl @rkooo567 , please review again, thanks
Core tests look good now. Can you rebase with master to rerun the Java tests? It looks like
java.lang.NoClassDefFoundError: com/fasterxml/jackson/core/util/JacksonFeature
is a potential dependency issue.
hi @jovany-wang , Do you have any idea about this NoClassDefFoundError? We only add a few Arrow dependencies, I don't see how this error is related to this PR. Thanks for your help
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
- If you'd like to keep this open, just leave any comment, and the stale label will be removed.
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
- If you'd like to keep this open, just leave any comment, and the stale label will be removed.