feat(api): standardize the position of null values in sorted columns and provide an api for specifying null order
Discussed in https://github.com/ibis-project/ibis/discussions/7246
Originally posted by ianmcook September 28, 2023
Currently Ibis does not standardize the position of nulls in ordered output. If a user calls t.order_by(_.x) or t.order_by(_.x.desc()) and the column x has a mix of null and non-null values, then whether the null values appear at the top (first) or the bottom (last) of the results depends entirely on the default behavior of the Ibis backend you are using. Also, Ibis currently does not expose an option to allow users to control whether nulls sort first or last.
The behavior of the backend engines is not consistent, but almost all of them do expose an option to control whether nulls sort first or last:
| backend | default sort position of nulls | user can specify nulls first or last |
|---|---|---|
| bigquery | first for ascending, last for descending | yes |
| clickhouse | last | yes |
| dask | last | yes |
| datafusion | last for ascending, first for descending | yes |
| druid | ? | ? |
| duckdb | last (since 0.8) | yes |
| impala | last for ascending, first for descending | yes |
| mssql | first for ascending, last for descending | no |
| mysql | first for ascending, last for descending | no |
| oracle | last for ascending, first for descending | yes |
| pandas | last | yes |
| polars | first | yes |
| postgres | last for ascending, first for descending | yes |
| pyspark | first for ascending, last for descending | yes |
| snowflake | last for ascending, first for descending | yes |
| sqlite | first for ascending, last for descending | yes |
| trino | last | yes |
For the backends that do not directly expose the option to control nulls first or nulls last in SQL (mssql, mysql), there is a simple workaround to control this: do ORDER BY x IS NOT NULL, x to get nulls first and do ORDER BY x IS NULL, x to get nulls last.
Note that this does not consider NaNs in float columns, which are another story and should probably be discussed separately.
Should Ibis standardize null sort order behavior across all the backends?
Also from #7246:
I think we'll default to
NULLS LAST, and provide an API for specifying null ordering intable.order_by.
Something to keep in mind for the future is that we'll need to add an API for this for window functions as well, since they support ordering.
From what I'm reading here and for reference for whoever tackles this:
Is the agreement to default to NULLS LAST always (weather is asc() or desc()), and provide and API for specifying null ordering when possible, and when not, do the work around ORDER BY x IS NOT NULL, x ?
Out of scope:
- this does not consider NaNs in float columns
- we'll need to add an API for this for window functions as well, since they support ordering.
Is the agreement to default to NULLS LAST always (weather is
asc()ordesc()), and provide and API for specifying null ordering when possible
Yes, that is my understanding
and when not, do the work around
ORDER BY x IS NOT NULL, x?
Since the default sort order is always ascending, and since false comes before true in ascending sort order, it would be ORDER BY x IS NULL, x