ray icon indicating copy to clipboard operation
ray copied to clipboard

Revert "Revert "[Core] Support Arrow zerocopy serialization in object…

Open Deegue opened this issue 1 year ago • 3 comments

… store (#35110)" (#36000)"

This reverts commit 822904b312423010682fa92f4085c05e08819337.

Why are these changes needed?

Last PR was reverted since test_advanced_9 failed. However, I tested test_advanced_9 on local device and it passed.

image

This assert failure looks unrelated to this PR.

https://github.com/ray-project/ray/blob/0b190ee1160eeca9796bc091e07eaebf4c85b511/python/ray/tests/test_advanced_9.py#L426-L430

Related issue number

Checks

  • [ ] I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • [ ] I've run scripts/format.sh to lint the changes in this PR.
  • [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
    • [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in doc/source/tune/api/ under the corresponding .rst file.
  • [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • [ ] Unit tests
    • [ ] Release tests
    • [ ] This PR is not tested :(

Deegue avatar Jun 07 '23 08:06 Deegue

It's weird test_advanced_9 of (Medium A-J) failed in CI but worked well on my local device..

Deegue avatar Jun 07 '23 10:06 Deegue

Gentle ping @ericl @rkooo567 for help, thanks~

Deegue avatar Jun 12 '23 02:06 Deegue

Not really sure, but perhaps try disabling parts of your PR to see what is causing the issue? It may be worth adding more debug logs for where OMP_NUM_THREADS is being set as well.

ericl avatar Jun 12 '23 17:06 ericl

Could we also tag other reviews who have context on the original PR for review? @Deegue

And let me know if there's anything I can help or I don't think I am the best person to approve it.

rickyyx avatar Jul 11 '23 17:07 rickyyx

can you let me know what were the major changes from the PR that was merged?

rkooo567 avatar Jul 11 '23 23:07 rkooo567

Could we also tag other reviews who have context on the original PR for review? @Deegue

And let me know if there's anything I can help or I don't think I am the best person to approve it.

Ok, here is the original PR https://github.com/ray-project/ray/pull/35110 and gentle ping @ericl for review since the previous failed case was passed.

Thanks again @rickyyx , could you help to find the fail reason of Arrow 6? I'm not sure if it's because of import pyarrow since there is no stacktrace.

Deegue avatar Jul 12 '23 06:07 Deegue

can you let me know what were the major changes from the PR that was merged?

Thanks for comment, the major change is using arrow serialization for object store read and write, which is zero-copy and brings better performance.

Deegue avatar Jul 12 '23 06:07 Deegue

I got this stacktrace from one of the workers in the failed test for arrow 6

*** SIGSEGV received at time=1689148296 on cpu 1 ***
PC: @     0x7fdda5605b0c  (unknown)  arrow::(anonymous namespace)::NullArrayFactory::CreateChild()
    @     0x7fdde91d81d5        208  absl::lts_20220623::WriteFailureInfo()
    @     0x7fdde91d7f18         64  absl::lts_20220623::AbslFailureSignalHandler()
    @     0x7fddea56c420  (unknown)  (unknown)
    @     0x7fddea2a45f3        160  (unknown)
    @     0x7fdda56064a7        208  arrow::VisitTypeInline<>()
    @ ... and at least 1 more frames
[2023-07-12 07:51:36,709 E 19106 19106] logging.cc:361: *** SIGSEGV received at time=1689148296 on cpu 1 ***
[2023-07-12 07:51:36,709 E 19106 19106] logging.cc:361: PC: @     0x7fdda5605b0c  (unknown)  arrow::(anonymous namespace)::NullArrayFactory::CreateChild()
[2023-07-12 07:51:36,709 E 19106 19106] logging.cc:361:     @     0x7fdde91d81d5        208  absl::lts_20220623::WriteFailureInfo()
[2023-07-12 07:51:36,710 E 19106 19106] logging.cc:361:     @     0x7fdde91d7f31         64  absl::lts_20220623::AbslFailureSignalHandler()
[2023-07-12 07:51:36,710 E 19106 19106] logging.cc:361:     @     0x7fddea56c420  (unknown)  (unknown)
[2023-07-12 07:51:36,711 E 19106 19106] logging.cc:361:     @     0x7fddea2a45f3        160  (unknown)
[2023-07-12 07:51:36,711 E 19106 19106] logging.cc:361:     @     0x7fdda56064a7        208  arrow::VisitTypeInline<>()
[2023-07-12 07:51:36,711 E 19106 19106] logging.cc:361:     @ ... and at least 1 more frames
Fatal Python error: Segmentation fault

Stack (most recent call first):
  File "/opt/miniconda/lib/python3.8/site-packages/pyarrow/compute.py", line 625 in take
  File "/ray/python/ray/data/_internal/arrow_ops/transform_pyarrow.py", line 41 in take_table
  File "/ray/python/ray/data/_internal/arrow_block.py", line 320 in take
  File "/ray/python/ray/data/_internal/arrow_block.py", line 220 in random_shuffle
  File "/ray/python/ray/data/_internal/shuffle_and_partition.py", line 88 in reduce
  File "/ray/python/ray/_private/worker.py", line 777 in main_loop
  File "/ray/python/ray/_private/workers/default_worker.py", line 262 in <module>


rickyyx avatar Jul 12 '23 17:07 rickyyx

@rickyyx I'm helping to fix the CI. The arrow related tests are passed. Other failures seems unrelated?

kira-lin avatar Jul 26 '23 06:07 kira-lin

Hi @ericl @rkooo567 , please review again, thanks

kira-lin avatar Jul 31 '23 05:07 kira-lin

Core tests look good now. Can you rebase with master to rerun the Java tests? It looks like

java.lang.NoClassDefFoundError: com/fasterxml/jackson/core/util/JacksonFeature

is a potential dependency issue.

ericl avatar Jul 31 '23 22:07 ericl

hi @jovany-wang , Do you have any idea about this NoClassDefFoundError? We only add a few Arrow dependencies, I don't see how this error is related to this PR. Thanks for your help

kira-lin avatar Aug 11 '23 03:08 kira-lin

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

  • If you'd like to keep this open, just leave any comment, and the stale label will be removed.

stale[bot] avatar Sep 16 '23 22:09 stale[bot]

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

  • If you'd like to keep this open, just leave any comment, and the stale label will be removed.

stale[bot] avatar Dec 15 '23 07:12 stale[bot]