arrow GH-26818: [C++][Python] Preserve order when writing dataset multi-threaded

trafficstars

Rationale for this change

The order of rows in a dataset might be important for users and should be preserved when writing to a filesystem. With multi-threaded write, the order is currently not guaranteed,

What changes are included in this PR?

Preserving the dataset order of rows requires the SourceNode to use ImplicitOrdering (this gives exec batches an index), and the ConsumingSinkNode to sequence exec batches (preserve order of batches by their index).

User-facing changes:

Add option preserve_order to FileSystemDatasetWriteOptions

Dev-facing changes:

Add option ordering to SourceNodeOptions
Add option implicit_ordering to ScanNodeOptions

Default behaviour is current behaviour.

Are these changes tested?

Unit tests have been added,

Are there any user-facing changes?

Users can set FileSystemDatasetWriteOptions.preserve_order = true (C++) / arrow.dataset.write_dataset(..., preserve_order=True) (Python).

GitHub Issue: #26818

Oct 18 '24 11:10 EnricoMi

:warning: GitHub issue #26818 has been automatically assigned in GitHub to PR creator.

Oct 18 '24 11:10 github-actions[bot]

This pull request seems to functionally overlap with this one. Some changes are almost exactly the same. Ordering of data is kept in threaded execution with use of batch index. Can you check whether it fixes your use case also?

Oct 29 '24 20:10 gitmodimo

Since you are fixing dataset write ordering I think this check never fires. It should be moved to InsertBatch. Also probaly AccumulationQueue, SequencingQueue and SerialSequencingQueue should be exported for acero nodes developers.

Oct 31 '24 16:10 gitmodimo

@gitmodimo I think that refactoring should be done in a separate PR keeping this PR focused on fixing the issue.

Dec 04 '24 08:12 EnricoMi

@zanmato1984 this touches related code area as #44616. Hoping you could take a look when you find time.

Feb 11 '25 08:02 EnricoMi

Hi @EnricoMi , I can take a look. Just first glance but do you think the PR description could be updated accordingly?

Feb 11 '25 11:02 zanmato1984

:warning: GitHub issue #26818 has been automatically assigned in GitHub to PR creator.

Feb 11 '25 14:02 github-actions[bot]

Also, you might want to update the PR description to reflect its latest purpose.

Feb 19 '25 09:02 zanmato1984

Also, you might want to update the PR description to reflect its latest purpose.

I think the PR description is up-to-date, do you see any discrepancy?

Mar 07 '25 10:03 EnricoMi

@zanmato1984 @gitmodimo I have addressed your review comments

May 05 '25 04:05 EnricoMi

Let's wait for @rok a while before I can merge this. Thanks.

May 13 '25 21:05 zanmato1984

@github-actions crossbow submit -g cpp -g python

May 13 '25 21:05 zanmato1984

Revision: dfd958dfe360cbc6c7ed53bb690fababaee74b33

Submitted crossbow builds: ursacomputing/crossbow @ actions-58bcfa66f9

Task	Status
example-cpp-minimal-build-static
example-cpp-minimal-build-static-system-dependency
example-cpp-tutorial
example-python-minimal-build-fedora-conda
example-python-minimal-build-ubuntu-venv
test-alpine-linux-cpp
test-build-cpp-fuzz
test-conda-cpp
test-conda-cpp-meson
test-conda-cpp-valgrind
test-conda-python-3.10
test-conda-python-3.10-hdfs-2.9.2
test-conda-python-3.10-hdfs-3.2.1
test-conda-python-3.10-pandas-latest-numpy-latest
test-conda-python-3.11
test-conda-python-3.11-dask-latest
test-conda-python-3.11-dask-upstream_devel
test-conda-python-3.11-hypothesis
test-conda-python-3.11-pandas-latest-numpy-1.26
test-conda-python-3.11-pandas-latest-numpy-latest
test-conda-python-3.11-pandas-nightly-numpy-nightly
test-conda-python-3.11-pandas-upstream_devel-numpy-nightly
test-conda-python-3.11-spark-master
test-conda-python-3.12
test-conda-python-3.12-cpython-debug
test-conda-python-3.13
test-conda-python-3.9
test-conda-python-3.9-pandas-1.1.3-numpy-1.19.5
test-conda-python-emscripten
test-cuda-cpp-ubuntu-22.04-cuda-11.7.1
test-cuda-python-ubuntu-22.04-cuda-11.7.1
test-debian-12-cpp-amd64
test-debian-12-cpp-i386
test-debian-12-python-3-amd64
test-debian-12-python-3-i386
test-fedora-39-cpp
test-fedora-39-python-3
test-ubuntu-22.04-cpp
test-ubuntu-22.04-cpp-20
test-ubuntu-22.04-cpp-bundled
test-ubuntu-22.04-cpp-emscripten
test-ubuntu-22.04-cpp-no-threading
test-ubuntu-22.04-python-3
test-ubuntu-22.04-python-313-freethreading
test-ubuntu-24.04-cpp
test-ubuntu-24.04-cpp-bundled-offline
test-ubuntu-24.04-cpp-gcc-13-bundled
test-ubuntu-24.04-cpp-gcc-14
test-ubuntu-24.04-cpp-minimal-with-formats
test-ubuntu-24.04-cpp-thread-sanitizer
test-ubuntu-24.04-python-3

May 13 '25 21:05 github-actions[bot]

Thanks for doing this @EnricoMi ! This was a long standing issue.

May 14 '25 07:05 rok

Thanks everyone for the thorough review!

May 14 '25 07:05 EnricoMi

After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit 021d8abea6f1d449039402df4791f6dfd37be9b6.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 6 possible false positives for unstable benchmarks that are known to sometimes produce them.

May 14 '25 15:05 conbench-apache-arrow[bot]

Perhaps it'd be worth it to add a benchmark for preserve_order == true to ensure there's no future regressions. Or would the nodes this uses already be benchmarked?

May 14 '25 15:05 rok

I'll look into this!

May 15 '25 06:05 EnricoMi

arrow arrow copied to clipboard

GH-26818: [C++][Python] Preserve order when writing dataset multi-threaded

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

arrow
arrow copied to clipboard