django icon indicating copy to clipboard operation
django copied to clipboard

Refs #28333 -- Added partial support for filter against window functions.

Open charettes opened this issue 1 year ago • 4 comments

Adds support for joint predicates against window functions through subquery wrapping while maintaining errors for disjointed filter attempts.

The implementation requires that filtered against window aliases are not masked by the usage of values() and errors out if it's the case.

The "qualify" wording was used to refer to predicates against window functions as it's the name of a specialized Snowflake database extension to achieve the same results without the subquery push down.

While not complete the implementation should cover most of the common use cases for filtering against window functions without requiring the complex subquery pushdown and predicate re-aliasing machinery to deal with disjointed cases.

A proper disjointed filtering implementation should likely be deferred until proper QUALIFY support lands or the ORM gains a proper subquery pushdown interface.

charettes avatar Aug 06 '22 01:08 charettes

One thing that needs to be tested more thoroughly is

  1. Properly handle column name collision
  2. Ensure references to non-selected fields kick off selection of them without requiring select_related usage.

This might require the introduction of logic analogous to relabel_aliases but for columns as the subquery wrapping merges all the column aliases together so there is no way to differentiate colliding column name by distinct table aliases anymore in the outer query filter.

charettes avatar Aug 07 '22 22:08 charettes

I believe that the only scenario that still requires adjustments is when doing a filter of the form filter(window=F("field")) as the "field" reference needs to be re-pointed to "col{idx}" to work properly.

charettes avatar Aug 09 '22 06:08 charettes

Alright, the latest issue should be resolved and tested as well. Tomorrow I'll work on splitting this in three commits

  1. Add Expression.replace_expression which can entirely supersede replace_references
  2. Move SQLCompiler.as_sql(with_col_aliases) logic to .get_select
  3. Implemented filter against jointed window function predicates

charettes avatar Aug 09 '22 07:08 charettes

For reference, the latest changes are causing two type of failures on Oracle

  1. expressions_case.tests.CaseExpressionTests.test_annotate_with_full_when failure is unrelated; it's a regression introduced by #15930.
  2. All the unmasking double-subquery pushdown tests (values, alias, etc) are failing with ORA-00933: SQL command not properly ended which hints at some form of syntax error

@felixxm let me know if that rings a bell

charettes avatar Aug 09 '22 14:08 charettes

For reference, the latest changes are causing two type of failures on Oracle

1. `expressions_case.tests.CaseExpressionTests.test_annotate_with_full_when` failure is unrelated; it's a regression introduced by [Fixed #33895 -- Fixed Case() crash with filters that match everything in When(). #15930](https://github.com/django/django/pull/15930).

2. All the unmasking double-subquery pushdown tests (`values`, `alias`, etc) are failing with `ORA-00933: SQL command not properly ended` which hints at some form of syntax error

@felixxm let me know if that rings a bell

It should be enough to remove AS:

diff --git a/django/db/models/sql/compiler.py b/django/db/models/sql/compiler.py
index 8b83175dcd..996e90fecf 100644
--- a/django/db/models/sql/compiler.py
+++ b/django/db/models/sql/compiler.py
@@ -643,7 +643,7 @@ class SQLCompiler:
                 ", ".join(cols),
                 "FROM (",
                 *result,
-                ") AS",
+                ") ",
                 self.connection.ops.quote_name("qualify_mask"),
             ]
         return result, list(inner_params) + qualify_params

felixxm avatar Aug 10 '22 06:08 felixxm

@felixxm thanks, I made the commit split and dropped the AS. Lets see what Oracle thinks.

charettes avatar Aug 10 '22 12:08 charettes

@charettes Thanks :+1: I moved the first commit with two extra small cleanups to the #15945.

felixxm avatar Aug 11 '22 05:08 felixxm

We should update docs (e.g. Window.filterable) and add release notes.

Will do!

charettes avatar Aug 11 '22 06:08 charettes

Merged the first two commits in 35911078fa40eb35859832987fedada76963c01e and 8c3046daade8d9b019928f96e53629b03060fe73.

felixxm avatar Aug 11 '22 10:08 felixxm

@felixxm good news is that adding support for filtering against annotated conditional expressions (e.g. Case and friends) happened to pave the way for direct expression filtering support and also make it easier to lift even more restriction with regards to heterogeneous predicates (disjointed filters mixing references to window functions, aggregates, and columns)

charettes avatar Aug 12 '22 02:08 charettes

Thanks for both review and merges @felixxm !

charettes avatar Aug 15 '22 14:08 charettes

Most of these tests are failing on CockroachDB due to ordering differences in the results, e.g.

======================================================================
FAIL: test_exclude (expressions_window.tests.WindowFunctionTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/tim/code/django/tests/expressions_window/tests.py", line 1065, in test_exclude
    self.assertQuerysetEqual(
  File "/home/tim/code/django/django/test/testcases.py", line 1328, in assertQuerysetEqual
    return self.assertEqual(list(items), values, msg=msg)
AssertionError: Lists differ: ['Jones', 'Jenson', 'Williams', 'Miller', 'Smith'] != ['Jenson', 'Jones', 'Williams', 'Miller', 'Smith']

First differing element 0:
'Jones'
'Jenson'

- ['Jones', 'Jenson', 'Williams', 'Miller', 'Smith']
?  ---------

+ ['Jenson', 'Jones', 'Williams', 'Miller', 'Smith']
?           +++++++++


----------------------------------------------------------------------
======================================================================
FAIL: test_filter (expressions_window.tests.WindowFunctionTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/tim/code/django/tests/expressions_window/tests.py", line 930, in test_filter
    self.assertQuerysetEqual(
  File "/home/tim/code/django/django/test/testcases.py", line 1328, in assertQuerysetEqual
    return self.assertEqual(list(items), values, msg=msg)
AssertionError: Lists differ: ['Jones', 'Jenson', 'Williams', 'Miller', 'Smith'] != ['Jenson', 'Jones', 'Williams', 'Miller', 'Smith']

First differing element 0:
'Jones'
'Jenson'

- ['Jones', 'Jenson', 'Williams', 'Miller', 'Smith']
?  ---------

+ ['Jenson', 'Jones', 'Williams', 'Miller', 'Smith']
?           +++++++++
 SELECT *
FROM   (SELECT "expressions_window_employee"."id"                        AS
               "col1",
               "expressions_window_employee"."name"                      AS
               "col2",
               "expressions_window_employee"."salary"                    AS
               "col3",
               "expressions_window_employee"."department"                AS
               "col4",
               "expressions_window_employee"."hire_date"                 AS
               "col5",
               "expressions_window_employee"."age"                       AS
               "col6",
               "expressions_window_employee"."classification_id"         AS
               "col7",
               "expressions_window_employee"."bonus"                     AS
               "col8",
               Rank()
                 OVER (
                   partition BY "expressions_window_employee"."department"
                   ORDER BY "expressions_window_employee"."salary" DESC) AS
               "department_salary_rank",
               ( Avg("expressions_window_employee"."age")
                   OVER (
                     partition BY "expressions_window_employee"."department")
                 - "expressions_window_employee"."age" )                 AS
                      "department_avg_age_diff"
        FROM   "expressions_window_employee"
        ORDER  BY "expressions_window_employee"."department" ASC,
                  "expressions_window_employee"."name" ASC) "qualify"
WHERE  "department_avg_age_diff" > 0.0;

Should the assertions ignore ordering? (Previously done for an expression_window test in 9100c664db5cca7512947e23d588cfcb937a7a92.)

timgraham avatar Aug 22 '22 13:08 timgraham

@timgraham I think there might be something wrong with CockroachDB here. The queryset is very explicitly ordered by (department, name)

https://github.com/django/django/blob/e9fd2b572410b1236da0d3d0933014138d89f44e/tests/expressions_window/tests.py#L915-L934 (see line 922)

And the SQL you provided has an explicit ORDER BY "expressions_window_employee"."department" ASC, "expressions_window_employee"."name" ASC statement.

Both Jenson and Jones are in the same Accounting department and "Jensen" < "Jones"

https://github.com/django/django/blob/e9fd2b572410b1236da0d3d0933014138d89f44e/tests/expressions_window/tests.py#L66-L74

charettes avatar Aug 22 '22 19:08 charettes

Correct, but I'm not sure if the SQL standard says that the order by in the inner query is guaranteed to carry over to the results of the outer query?

timgraham avatar Aug 22 '22 20:08 timgraham

That must be it. It seems the standard is ambiguous on the subject but all database supported by Django have implemented it for simple derived tables.

What has been our previous position on such issues? We could add a feature flag for it but none of the core Django backends would require it or we could systematically add the ORDER BY to both the inner and outer queries.

charettes avatar Aug 22 '22 20:08 charettes

@timgraham after thinking more about it the problem is really complex to solve for backends that don't carry over ordering of single derived tables like CockroachDB.

The main issue is that for the ORDER BY clause to be in the outer most query it needs to be selected by the inner query which could have an impact on the group by, aggregation and final select mask if values is used to limit the fields selected and is not a superset of the columns specified in the ORDER BY.

What I suggest we do here is potentially add a supports_derived_table_ordering feature and raise a NotSupportedError in cases where order_by is used with window functions filtering and backends don't support it. Some of the tests could then be adjusted to drop the ordering guarantees (e.g. use ordered=False) and an explicit test with ordering could be added.

Thoughts on the above @timgraham and @felixxm ?

charettes avatar Aug 26 '22 17:08 charettes

I think I'll raise the issue with CockroachDB and see if they might consider changing the behavior since they often like to mimic PostgreSQL.

timgraham avatar Aug 26 '22 18:08 timgraham

As mentioned above, the SQL standard (and the Postgres docs) don't guarantee ordering across subqueries, but PG often does preserve the order in practice. It seems potentially brittle to rely on that behavior, yet at the same time, it seems like something that PG is unlikely to change.

As far as CRDB, we introduced the propagate_input_ordering session variable to force that behavior. There is a performance penalty, though I don't know exactly how much.

rafiss avatar Aug 26 '22 22:08 rafiss

As mentioned above, the SQL standard (and the Postgres docs) don't guarantee ordering across subqueries

I assume you are referring to the F851 feature? I don't have access to the latest SQL standard myself.

but PG often does preserve the order in practice. It seems potentially brittle to rely on that behavior, yet at the same time, it seems like something that PG is unlikely to change

It seems to be the case for not only PG though; Oracle, MySQL, and SQLite seem to behave this way with regards to derived tables from subqueries ordering propagation.

charettes avatar Aug 26 '22 22:08 charettes

Here is some more discussion on the matter, with better sources: https://dba.stackexchange.com/a/184167

Quote from Postgres docs:

If sorting is not chosen, the rows will be returned in an unspecified order. The actual order in that case will depend on the scan and join plan types and the order on disk, but it must not be relied on. A particular output ordering can only be guaranteed if the sort step is explicitly chosen.

rafiss avatar Sep 07 '22 16:09 rafiss

If sorting is not chosen, the rows will be returned in an unspecified order. The actual order in that case will depend on the scan and join plan types and the order on disk, but it must not be relied on. A particular output ordering can only be guaranteed if the sort step is explicitly chosen.

This statement from the Postgres docs has nothing to do with derived table ordering though, it's a general statement about what happens if you don't specify an ORDER BY clause at all as pointed out in the comment section

The quote you added for Postgres actually applies to a different case: queries with no ORDER BY at all.

The comment section also points at this section of the documentation

Alternatively, supplying the input values from a sorted subquery will usually work. For example: SELECT xmlagg(x) FROM (SELECT x FROM test ORDER BY y DESC) AS tab; Beware that this approach can fail if the outer query level contains additional processing, such as a join, because that might cause the subquery's output to be reordered before the aggregate is computed.

The approach taken here will never perform additional processing in the outer query except for filtering via WHERE against window functions.


If we consider this undefined behaviour a solution could be to also order by at the outer most query level as well. Not too hard to do if deemed necessary.

charettes avatar Sep 07 '22 17:09 charettes

Spent some time working on it in https://github.com/django/django/pull/16040. Still not convinced this should be merged given this is not an issue for any officially supported backends.

charettes avatar Sep 09 '22 04:09 charettes