fix: miss output ordering during projection
Which issue does this PR close?
- Related my use case
Rationale for this change
Currently, the base_config.output_ordering is specified by the external mechanism; the users usually decide on the output_ordering. See here: https://github.com/apache/datafusion/blob/main/datafusion/core/src/datasource/listing/table.rs#L273-L277
That is, users should ensure that the output ordering is correct.
Most users may not enable the collect_statistic config, so they will miss output ordering here. Even if the config is enabled, the statistic may be inaccurate, or for some file formats, they don't have statistics. If users write the output_ordering to config, I guess they think/believe their data is sorted, but the check in get_projected_output_ordering will drop the output ordering for them.
What changes are included in this PR?
Still do the check, but only output warning.
Are these changes tested?
Are there any user-facing changes?
I don't understand the failing sqllogictest:
# Check output plan again, expect no "output_ordering" clause in the physical_plan -> ParquetExec,
# due to there being more files than partitions:
query TT
EXPLAIN SELECT int_col, string_col
FROM test_table
ORDER BY string_col, int_col;
----
logical_plan
01)Sort: test_table.string_col ASC NULLS LAST, test_table.int_col ASC NULLS LAST
02)--TableScan: test_table projection=[int_col, string_col]
physical_plan
01)SortPreservingMergeExec: [string_col@1 ASC NULLS LAST, int_col@0 ASC NULLS LAST]
02)--SortExec: expr=[string_col@1 ASC NULLS LAST, int_col@0 ASC NULLS LAST], preserve_partitioning=[true]
03)----DataSourceExec: file_groups={2 groups: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet/test_table/0.parquet, WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet/test_table/1.parquet], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet/test_table/2.parquet]]}, projection=[int_col, string_col], file_type=parquet
Based on current code and doc, my understanding is that the output_ordering, aka. file_sort_order refers to the file order, that is, if we specify the output_ordering, what we can ensure is that the data in a single file is ordered. https://datafusion.apache.org/user-guide/sql/ddl.html#cautions-when-using-the-with-order-clause
So why expect no "output_ordering" clause in the physical_plan -> ParquetExec due to there being more files than partitions? 🤔
That is, users should ensure that the output ordering is correct.
One of the users as of now is ListingTable, which I don't believe makes any such guarantees, so we would have to fix ListingTable to ensure the output_ordering is correct. But even if we did, I would worry about breaking third-party users of FileScanConfig that rely on DataFusion to perform this check.
Based on current code and doc, my understanding is that the
output_ordering, aka.file_sort_orderrefers to the file order, that is, if we specify the output_ordering, what we can ensure is that the data in a single file is ordered. https://datafusion.apache.org/user-guide/sql/ddl.html#cautions-when-using-the-with-order-clauseSo why
expect no "output_ordering" clause in the physical_plan -> ParquetExec due to there being more files than partitions? 🤔
In the past we only added an output_ordering to the plan if each file group had at most 1 file. Otherwise concatenating files isn't guaranteed to preserve order. Later we relaxed this by looking at statistics to see if the files are nonoverlapping and ordered with respect to the sort keys.
I do think the current behavior is a potential pain point, because sometimes users know their files are ordered and non-overlapping, but can't get DataFusion to avoid the sort. However, this has been the behavior in DataFusion for a long time (at least the 2 years I have been using it), so if we were to change it, I would prefer if we do it in a way that makes this behavior change very obvious.
I do think the current behavior is a potential pain point, because sometimes users know their files are ordered and non-overlapping, but can't get DataFusion to avoid the sort. However, this has been the behavior in DataFusion for a long time (at least the 2 years I have been using it), so if we were to change it, I would prefer if we do it in a way that makes this behavior change very obvious.
How about adding a new config, such as check_file_group_ordering, and make it true as default? In this way, we won't change the current behavior. But I'm concerned that if users only wanna one of the tables to disable the check, they need to implement their own table configuration or something else to avoid the config influencing the global. Also cc @alamb
Also, I noticed the output_ordering here: https://github.com/apache/datafusion/blob/main/datafusion/datasource/src/file_scan_config.rs#L807 isn't checked.
Maybe we can extract the check logic to a method and add a check for the output_ordering too.
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.