arrow icon indicating copy to clipboard operation
arrow copied to clipboard

GH-26818: [C++][Python] Preserve order when writing dataset multi-threaded

Open EnricoMi opened this issue 1 year ago • 1 comments
trafficstars

Rationale for this change

The order of rows in a dataset might be important for users and should be preserved when writing to a filesystem. With multi-threaded write, the order is currently not guaranteed,

What changes are included in this PR?

Preserving the dataset order of rows requires the SourceNode to use ImplicitOrdering (this gives exec batches an index), and the ConsumingSinkNode to sequence exec batches (preserve order of batches by their index).

User-facing changes:

  • Add option preserve_order to FileSystemDatasetWriteOptions

Dev-facing changes:

  • Add option ordering to SourceNodeOptions
  • Add option implicit_ordering to ScanNodeOptions

Default behaviour is current behaviour.

Are these changes tested?

Unit tests have been added,

Are there any user-facing changes?

Users can set FileSystemDatasetWriteOptions.preserve_order = true (C++) / arrow.dataset.write_dataset(..., preserve_order=True) (Python).

  • GitHub Issue: #26818

EnricoMi avatar Oct 18 '24 11:10 EnricoMi

:warning: GitHub issue #26818 has been automatically assigned in GitHub to PR creator.

github-actions[bot] avatar Oct 18 '24 11:10 github-actions[bot]

This pull request seems to functionally overlap with this one. Some changes are almost exactly the same. Ordering of data is kept in threaded execution with use of batch index. Can you check whether it fixes your use case also?

gitmodimo avatar Oct 29 '24 20:10 gitmodimo

Since you are fixing dataset write ordering I think this check never fires. It should be moved to InsertBatch. Also probaly AccumulationQueue, SequencingQueue and SerialSequencingQueue should be exported for acero nodes developers.

gitmodimo avatar Oct 31 '24 16:10 gitmodimo

@gitmodimo I think that refactoring should be done in a separate PR keeping this PR focused on fixing the issue.

EnricoMi avatar Dec 04 '24 08:12 EnricoMi

@zanmato1984 this touches related code area as #44616. Hoping you could take a look when you find time.

EnricoMi avatar Feb 11 '25 08:02 EnricoMi

Hi @EnricoMi , I can take a look. Just first glance but do you think the PR description could be updated accordingly?

zanmato1984 avatar Feb 11 '25 11:02 zanmato1984

:warning: GitHub issue #26818 has been automatically assigned in GitHub to PR creator.

github-actions[bot] avatar Feb 11 '25 14:02 github-actions[bot]

Also, you might want to update the PR description to reflect its latest purpose.

zanmato1984 avatar Feb 19 '25 09:02 zanmato1984

Also, you might want to update the PR description to reflect its latest purpose.

I think the PR description is up-to-date, do you see any discrepancy?

EnricoMi avatar Mar 07 '25 10:03 EnricoMi

@zanmato1984 @gitmodimo I have addressed your review comments

EnricoMi avatar May 05 '25 04:05 EnricoMi

Let's wait for @rok a while before I can merge this. Thanks.

zanmato1984 avatar May 13 '25 21:05 zanmato1984

@github-actions crossbow submit -g cpp -g python

zanmato1984 avatar May 13 '25 21:05 zanmato1984

Revision: dfd958dfe360cbc6c7ed53bb690fababaee74b33

Submitted crossbow builds: ursacomputing/crossbow @ actions-58bcfa66f9

Task Status
example-cpp-minimal-build-static GitHub Actions
example-cpp-minimal-build-static-system-dependency GitHub Actions
example-cpp-tutorial GitHub Actions
example-python-minimal-build-fedora-conda GitHub Actions
example-python-minimal-build-ubuntu-venv GitHub Actions
test-alpine-linux-cpp GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-meson GitHub Actions
test-conda-cpp-valgrind GitHub Actions
test-conda-python-3.10 GitHub Actions
test-conda-python-3.10-hdfs-2.9.2 GitHub Actions
test-conda-python-3.10-hdfs-3.2.1 GitHub Actions
test-conda-python-3.10-pandas-latest-numpy-latest GitHub Actions
test-conda-python-3.11 GitHub Actions
test-conda-python-3.11-dask-latest GitHub Actions
test-conda-python-3.11-dask-upstream_devel GitHub Actions
test-conda-python-3.11-hypothesis GitHub Actions
test-conda-python-3.11-pandas-latest-numpy-1.26 GitHub Actions
test-conda-python-3.11-pandas-latest-numpy-latest GitHub Actions
test-conda-python-3.11-pandas-nightly-numpy-nightly GitHub Actions
test-conda-python-3.11-pandas-upstream_devel-numpy-nightly GitHub Actions
test-conda-python-3.11-spark-master GitHub Actions
test-conda-python-3.12 GitHub Actions
test-conda-python-3.12-cpython-debug GitHub Actions
test-conda-python-3.13 GitHub Actions
test-conda-python-3.9 GitHub Actions
test-conda-python-3.9-pandas-1.1.3-numpy-1.19.5 GitHub Actions
test-conda-python-emscripten GitHub Actions
test-cuda-cpp-ubuntu-22.04-cuda-11.7.1 GitHub Actions
test-cuda-python-ubuntu-22.04-cuda-11.7.1 GitHub Actions
test-debian-12-cpp-amd64 GitHub Actions
test-debian-12-cpp-i386 GitHub Actions
test-debian-12-python-3-amd64 GitHub Actions
test-debian-12-python-3-i386 GitHub Actions
test-fedora-39-cpp GitHub Actions
test-fedora-39-python-3 GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-bundled GitHub Actions
test-ubuntu-22.04-cpp-emscripten GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
test-ubuntu-22.04-python-3 GitHub Actions
test-ubuntu-22.04-python-313-freethreading GitHub Actions
test-ubuntu-24.04-cpp GitHub Actions
test-ubuntu-24.04-cpp-bundled-offline GitHub Actions
test-ubuntu-24.04-cpp-gcc-13-bundled GitHub Actions
test-ubuntu-24.04-cpp-gcc-14 GitHub Actions
test-ubuntu-24.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-24.04-cpp-thread-sanitizer GitHub Actions
test-ubuntu-24.04-python-3 GitHub Actions

github-actions[bot] avatar May 13 '25 21:05 github-actions[bot]

Thanks for doing this @EnricoMi ! This was a long standing issue.

rok avatar May 14 '25 07:05 rok

Thanks everyone for the thorough review!

EnricoMi avatar May 14 '25 07:05 EnricoMi

After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit 021d8abea6f1d449039402df4791f6dfd37be9b6.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 6 possible false positives for unstable benchmarks that are known to sometimes produce them.

Perhaps it'd be worth it to add a benchmark for preserve_order == true to ensure there's no future regressions. Or would the nodes this uses already be benchmarked?

rok avatar May 14 '25 15:05 rok

I'll look into this!

EnricoMi avatar May 15 '25 06:05 EnricoMi