django-rest-framework-filters icon indicating copy to clipboard operation
django-rest-framework-filters copied to clipboard

Investigate using subqueries to filter across relationships

Open rpkilby opened this issue 9 years ago • 2 comments

Just to rehash, the goal of this issue is to investigate the differences between the following two queries:

Blog.objects.filter(entry__in=Entry.objects.filter(headline__contains="Lennon"))

Blog.objects.filter(
    pk__in=Entry.objects \
                .filter(headline__contains="Lennon") \
                .values('blog')
)

In the latter query, by filtering on the pk of the blogs with the blog_ids of the filtered entries, we avoid a having to JOIN and having to specify DISTINCT.


As detailed here, there are nuances to filtering across to-many relationships. In general, most users would want to find blogs with entries about Lennon during 2008, not blogs that have entries about Lennon and that also have entries during 2008.

# combined join statements, generally the desired case
Blog.objects.filter(entry__headline__contains='Lennon', entry__pub_date__year=2008)

# chained join statements, the latter, less desirable case
Blog.objects.filter(entry__headline__contains='Lennon').filter(entry__pub_date__year=2008)

# Nested join is equivalent to the first case
Blog.objects.filter(
    entry=Entry.objects.filter(
        headline__contains='Lennon',
        pub_date__year=2008,
    ),
)

# It's also possible to use a nested subquery-only approach
Blog.objects.filter(
    id__in=Entry.objects.filter(
        headline__contains='Lennon',
        pub_date__year=2008,
    ).values('blog'),
)

In general, the combined join statements syntax (1st case above) is not possible to construct with the current implementation of django-filter. To make this possible, the filter API would need to return Q objects instead of invoking filter() and exclude() directly. Additionally, exclude() does not follow the same rules as filter(), and combined joins will behave like chained excludes.

  • Need to compare the sql/results of each approach. What queries are interchangeable?
  • Test matrix:
    • filter vs exclude
    • chained vs combined
    • join statements vs nested join vs nested subquery

Either way, it will be necessary to use either the nested join or nested subquery approach. This makes it possible to filter the default queryset (permissions handling).

Todo:

  • are chained and combined subqueries equivalent?

    Blog.objects.filter(
        entry__in=Entry.objects.filter(
            headline__contains='Lennon',
            pub_date__year=2008,
        ),
    )
    
    Blog.objects.filter(
        entry=Entry.objects \
            .filter(headline__contains='Lennon') \
            .filter(pub_date__year=2008)
    )
    

rpkilby avatar Jul 31 '16 18:07 rpkilby

Results:

q1 is the current behavior, but generally incorrect and has poor performance (n table joins for n filter statements). This also requires a distinct() call in to deduplicate the results.

q2 is unfeasible to generate, but correct.

q3/q4 and q5/q6 should be equivalent, with the exception that q3 & q4 rely on a join. That said, q3 & q4 are easier to generate than the subquery approach. The subquery approach is more difficult since it doesn't automatically account for to_fields.


filter() queries:

join statements nested join nested subquery
chained q1 q3 q5
combined q2 q4 q6

q1 = Blog.objects \
    .filter(entry__headline__contains='Lennon') \
    .filter(entry__pub_date__year=2008)

q2 = Blog.objects.filter(
    entry__headline__contains='Lennon',
    entry__pub_date__year=2008,
)

q3 = Blog.objects.filter(
    entry__in=Entry.objects \
        .filter(headline__contains='Lennon') \
        .filter(pub_date__year=2008)
)

q4 = Blog.objects.filter(
    entry__in=Entry.objects.filter(
        headline__contains='Lennon',
        pub_date__year=2008,
    ),
)

q5 = Blog.objects.filter(
    id__in=Entry.objects \
        .filter(headline__contains='Lennon') \
        .filter(pub_date__year=2008) \
        .values('blog')
)

q6 = Blog.objects.filter(
    id__in=Entry.objects.filter(
        headline__contains='Lennon',
        pub_date__year=2008,
    ).values('blog'),
)

q1

SELECT "testapp_blog"."id",
       "testapp_blog"."name"
FROM "testapp_blog"
INNER JOIN "testapp_entry" ON ("testapp_blog"."id" = "testapp_entry"."blog_id")
INNER JOIN "testapp_entry" T3 ON ("testapp_blog"."id" = T3."blog_id")
WHERE ("testapp_entry"."headline" LIKE %Lennon% ESCAPE '\'
       AND T3."pub_date" BETWEEN 2008-01-01 AND 2008-12-31) 

q2

SELECT "testapp_blog"."id",
       "testapp_blog"."name"
FROM "testapp_blog"
INNER JOIN "testapp_entry" ON ("testapp_blog"."id" = "testapp_entry"."blog_id")
WHERE ("testapp_entry"."headline" LIKE %Lennon% ESCAPE '\'
       AND "testapp_entry"."pub_date" BETWEEN 2008-01-01 AND 2008-12-31) 

q3 & q4

SELECT "testapp_blog"."id",
       "testapp_blog"."name"
FROM "testapp_blog"
INNER JOIN "testapp_entry" ON ("testapp_blog"."id" = "testapp_entry"."blog_id")
WHERE "testapp_entry"."id" IN
    (SELECT U0."id"
     FROM "testapp_entry" U0
     WHERE (U0."headline" LIKE %Lennon% ESCAPE '\'
            AND U0."pub_date" BETWEEN 2008-01-01 AND 2008-12-31)) 

q5 & q6

SELECT "testapp_blog"."id",
       "testapp_blog"."name"
FROM "testapp_blog"
WHERE "testapp_blog"."id" IN
    (SELECT U0."blog_id"
     FROM "testapp_entry" U0
     WHERE (U0."headline" LIKE %Lennon% ESCAPE '\'
            AND U0."pub_date" BETWEEN 2008-01-01 AND 2008-12-31)) 

rpkilby avatar Nov 13 '16 01:11 rpkilby

The upcoming v1 will change how queries are constructed, going from queries like q1 to queries like q3.

However, this issue is now more about the differences q3 and q5, and that will not be resolved in the release. For now, removing from the milestone.

The change from q3 to q5 would also make the distinct argument pointless.

rpkilby avatar Oct 26 '17 13:10 rpkilby