ibis icon indicating copy to clipboard operation
ibis copied to clipboard

feat(api): standardize the position of null values in sorted columns and provide an api for specifying null order

Open cpcloud opened this issue 2 years ago • 4 comments

Discussed in https://github.com/ibis-project/ibis/discussions/7246

Originally posted by ianmcook September 28, 2023 Currently Ibis does not standardize the position of nulls in ordered output. If a user calls t.order_by(_.x) or t.order_by(_.x.desc()) and the column x has a mix of null and non-null values, then whether the null values appear at the top (first) or the bottom (last) of the results depends entirely on the default behavior of the Ibis backend you are using. Also, Ibis currently does not expose an option to allow users to control whether nulls sort first or last.

The behavior of the backend engines is not consistent, but almost all of them do expose an option to control whether nulls sort first or last:

backend default sort position of nulls user can specify nulls first or last
bigquery first for ascending, last for descending yes
clickhouse last yes
dask last yes
datafusion last for ascending, first for descending yes
druid ? ?
duckdb last (since 0.8) yes
impala last for ascending, first for descending yes
mssql first for ascending, last for descending no
mysql first for ascending, last for descending no
oracle last for ascending, first for descending yes
pandas last yes
polars first yes
postgres last for ascending, first for descending yes
pyspark first for ascending, last for descending yes
snowflake last for ascending, first for descending yes
sqlite first for ascending, last for descending yes
trino last yes

For the backends that do not directly expose the option to control nulls first or nulls last in SQL (mssql, mysql), there is a simple workaround to control this: do ORDER BY x IS NOT NULL, x to get nulls first and do ORDER BY x IS NULL, x to get nulls last.

Note that this does not consider NaNs in float columns, which are another story and should probably be discussed separately.

Should Ibis standardize null sort order behavior across all the backends?

cpcloud avatar Sep 29 '23 10:09 cpcloud

Also from #7246:

I think we'll default to NULLS LAST, and provide an API for specifying null ordering in table.order_by.

gforsyth avatar Sep 29 '23 11:09 gforsyth

Something to keep in mind for the future is that we'll need to add an API for this for window functions as well, since they support ordering.

cpcloud avatar Sep 29 '23 14:09 cpcloud

From what I'm reading here and for reference for whoever tackles this:

Is the agreement to default to NULLS LAST always (weather is asc() or desc()), and provide and API for specifying null ordering when possible, and when not, do the work around ORDER BY x IS NOT NULL, x ?

Out of scope:

  • this does not consider NaNs in float columns
  • we'll need to add an API for this for window functions as well, since they support ordering.

ncclementi avatar May 21 '24 19:05 ncclementi

Is the agreement to default to NULLS LAST always (weather is asc() or desc()), and provide and API for specifying null ordering when possible

Yes, that is my understanding

and when not, do the work around ORDER BY x IS NOT NULL, x ?

Since the default sort order is always ascending, and since false comes before true in ascending sort order, it would be ORDER BY x IS NULL, x

ianmcook avatar May 21 '24 20:05 ianmcook