arrow icon indicating copy to clipboard operation
arrow copied to clipboard

GH-43130: [C++][ArrowFlight] Crash due to UCS thread mode

Open amirgon opened this issue 1 year ago • 9 comments

When mode is UCS_THREAD_MODE_SERIALIZED, UCX crash due to mpool corruption.
This happens when buffer is deallocated on a different thread. In such case two threads access UCX memory pool simultaneously.

See discussion on UCX forum: https://github.com/openucx/ucx/discussions/9987

  • GitHub Issue: #43130

amirgon avatar Jul 02 '24 15:07 amirgon

Thanks! However, this code is scheduled for deprecation/removal soon in favor of the Disassociated IPC proposal

lidavidm avatar Jul 02 '24 22:07 lidavidm

This is not a MINOR change. See https://github.com/apache/arrow/blob/main/CONTRIBUTING.md#Minor-Fixes for our MINOR definition. If we need this change, could you open a new issue for this?

kou avatar Jul 02 '24 23:07 kou

Thanks! However, this code is scheduled for deprecation/removal soon in favor of the Disassociated IPC proposal

@lidavidm - Could you please refer me to the "Disassociated IPC" proposal?
Does it mean UCX support is planned to be removed? What is the timeline for this?

If we need this change, could you open a new issue for this?

@kou - Sure, opened https://github.com/apache/arrow/issues/43130

amirgon avatar Jul 03 '24 08:07 amirgon

https://arrow.apache.org/docs/dev/format/DissociatedIPC.html

@zeroshade what was the timeline?

lidavidm avatar Jul 03 '24 08:07 lidavidm

https://arrow.apache.org/docs/dev/format/DissociatedIPC.html

Looks like the reference implementation for Disassociated IPC might suffer from the same issue:

https://github.com/apache/arrow-experiments/blob/188c4e5ff4bda08319d4520e380d736c36b9ee48/dissociated-ipc/ucx_client.cc#L55-L60

arrow::Result<std::unique_ptr<utils::Connection>> UcxClient::CreateConn() {
  ucp_worker_params_t worker_params;
  std::memset(&worker_params, 0, sizeof(worker_params));
  worker_params.field_mask =
      UCP_WORKER_PARAM_FIELD_THREAD_MODE | UCP_WORKER_PARAM_FIELD_FLAGS;
  worker_params.thread_mode = UCS_THREAD_MODE_SERIALIZED;

amirgon avatar Jul 03 '24 08:07 amirgon

It was partially based on this experiment so I'm not surprised.

I think for now I'm OK merging this, but please be aware it will go away in the near future.

lidavidm avatar Jul 03 '24 09:07 lidavidm

I think for now I'm OK merging this, but please be aware it will go away in the near future.

Sure, thanks. Consider fixing it also on arrow-experiments.

amirgon avatar Jul 03 '24 09:07 amirgon

As per the deprecation, see this email: https://lists.apache.org/thread/g89x2y6pvlq6gyf0d1jnxfl2onsrkyt8 I plan to work on a PR removing support soon.

raulcd avatar Jul 03 '24 09:07 raulcd

:warning: GitHub issue #43130 has been automatically assigned in GitHub to PR creator.

github-actions[bot] avatar Jul 03 '24 09:07 github-actions[bot]

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 5a28e180f0a078d0b1a970310d5c03ea6a351ffe.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 28 possible false positives for unstable benchmarks that are known to sometimes produce them.