arrow
arrow copied to clipboard
GH-26818: [C++][Python] Preserve order when writing dataset multi-threaded
Rationale for this change
The order of rows in a dataset might be important for users and should be preserved when writing to a filesystem. With multi-threaded write, the order is currently not guaranteed,
What changes are included in this PR?
Preserving the dataset order of rows requires the SourceNode to use ImplicitOrdering (this gives exec batches an index), and the ConsumingSinkNode to sequence exec batches (preserve order of batches by their index).
User-facing changes:
- Add option
preserve_ordertoFileSystemDatasetWriteOptions
Dev-facing changes:
- Add option
orderingtoSourceNodeOptions - Add option
implicit_orderingtoScanNodeOptions
Default behaviour is current behaviour.
Are these changes tested?
Unit tests have been added,
Are there any user-facing changes?
Users can set FileSystemDatasetWriteOptions.preserve_order = true (C++) / arrow.dataset.write_dataset(..., preserve_order=True) (Python).
- GitHub Issue: #26818
:warning: GitHub issue #26818 has been automatically assigned in GitHub to PR creator.
This pull request seems to functionally overlap with this one. Some changes are almost exactly the same. Ordering of data is kept in threaded execution with use of batch index. Can you check whether it fixes your use case also?
Since you are fixing dataset write ordering I think this check never fires. It should be moved to InsertBatch.
Also probaly AccumulationQueue, SequencingQueue and SerialSequencingQueue should be exported for acero nodes developers.
@gitmodimo I think that refactoring should be done in a separate PR keeping this PR focused on fixing the issue.
@zanmato1984 this touches related code area as #44616. Hoping you could take a look when you find time.
Hi @EnricoMi , I can take a look. Just first glance but do you think the PR description could be updated accordingly?
:warning: GitHub issue #26818 has been automatically assigned in GitHub to PR creator.
Also, you might want to update the PR description to reflect its latest purpose.
Also, you might want to update the PR description to reflect its latest purpose.
I think the PR description is up-to-date, do you see any discrepancy?
@zanmato1984 @gitmodimo I have addressed your review comments
Let's wait for @rok a while before I can merge this. Thanks.
@github-actions crossbow submit -g cpp -g python
Revision: dfd958dfe360cbc6c7ed53bb690fababaee74b33
Submitted crossbow builds: ursacomputing/crossbow @ actions-58bcfa66f9
Thanks for doing this @EnricoMi ! This was a long standing issue.
Thanks everyone for the thorough review!
After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit 021d8abea6f1d449039402df4791f6dfd37be9b6.
There were no benchmark performance regressions. 🎉
The full Conbench report has more details. It also includes information about 6 possible false positives for unstable benchmarks that are known to sometimes produce them.
Perhaps it'd be worth it to add a benchmark for preserve_order == true to ensure there's no future regressions. Or would the nodes this uses already be benchmarked?
I'll look into this!