QuerySet.select_related(5) causes additional queries accessing DBRef of DBRef?

Open cldmkr opened this issue 3 years ago • 1 comments

Hello,

I want to load and deference a whole query/model as efficient as possible into memory, without subsequent calls to the database. Using QuerySet.select_related(max_depth=[n]) I think it is not deserializing the expected [n] levels, because as far as I see accessing a DBRef of a DBRef still causes additional queries to the database.

Example showing the issue:

Assuming a datamodel with 4 classes, like: 'A' <-1:n-- 'B' <-1:n-- 'C' <-1:n-- 'D' Means 1:n relationship where D references C, which references B, which references A.

Selecting an object of class 'D' using .select_related(max_depth=5) and accessing its DBRef fields via <D>.c.b.a.name, in my understanding all data should be there already and no additional requests to the database should be necessary.

Following test code generates test data, accesses the data in 3 ways, and uses monitoring.CommandListener to count the database accesses.

import time
from datetime import datetime
from mongoengine import *
from pymongo import monitoring


class CommandLogger(monitoring.CommandListener):
    def __init__(self):
        self.counter = 0

    def started(self, event):
        self.counter += 1

    def succeeded(self, event):
        pass

    def failed(self, event):
        pass

cl = CommandLogger()
monitoring.register(cl)



class A(Document):
    name = StringField(max_length=200)
    def __str__(self):
        return self.name

class B(Document):
    name = StringField(max_length=200)
    a = ReferenceField(A, required=True)
    def __str__(self):
        return self.name

class C(Document):
    name = StringField(max_length=200)
    b = ReferenceField(B, required=True)
    def __str__(self):
        return self.name

class D(Document):
    name = StringField(max_length=200)
    c = ReferenceField(C, required=True)
    def __str__(self):
        return self.name


def generate_documents(count):
    for a in range(1, count+1):
        doc_a = A(name='a_'+str(a))
        doc_a.save()
        for b in range(1, count+1):
            doc_b = B(name=doc_a.name + '_b_'+str(b), a=doc_a)
            doc_b.save()
            for c in range(1, count+1):
                doc_c = C(name=doc_b.name + '_c_'+str(c), b=doc_b)
                doc_c.save()
                for d in range(1, count+1):
                    doc_d = D(name=doc_c.name + '_d_'+str(d), c=doc_c)
                    doc_d.save()


# MAIN (for testing)
if __name__ == '__main__':
    # connect to database
    name = 'test_local'
    host = '127.0.0.1'
    port = 27017

    db = connect(name=name, host=host, port=port)

    print('Test preparations - generate documents')
    generate_documents(10)

    starttime = datetime.now()
    cl.counter = 0
    print('Test 1a - access: [ x.c.name for x in D.objects() ]')
    data = D.objects()
    [ x.c.name for x in data ]
    print('Test 1a - Duration: {} / calls: {}'.format(datetime.now() - starttime, cl.counter))

    starttime = datetime.now()
    cl.counter = 0
    print('Test 1b - access: [ x.c.b.a.name for x in D.objects() ]')
    data = D.objects()
    [ x.c.b.a.name for x in data ]
    print('Test 1b - Duration: {} / calls: {}'.format(datetime.now() - starttime, cl.counter))

    starttime = datetime.now()
    cl.counter = 0
    print('Test 2a - access: [ x.c.name for x in D.objects().select_related() ]')
    data = D.objects().select_related()
    [ x.c.name for x in data ]
    print('Test 2a - Duration: {} / calls: {}'.format(datetime.now() - starttime, cl.counter))

    starttime = datetime.now()
    cl.counter = 0
    print('Test 2b - access: [ x.c.b.a.name for x in D.objects().select_related() ]')
    data = D.objects().select_related()
    [ x.c.b.a.name for x in data ]
    print('Test 2b - Duration: {} / calls: {}'.format(datetime.now() - starttime, cl.counter))

    starttime = datetime.now()
    cl.counter = 0
    print('Test 3a - access: [ x.c.name for x in D.objects().select_related(max_depth=30) ]')
    data = D.objects().select_related(max_depth=30)
    [ x.c.name for x in data ]
    print('Test 3a - Duration: {} / calls: {}'.format(datetime.now() - starttime, cl.counter))

    starttime = datetime.now()
    cl.counter = 0
    print('Test 3b - access: [ x.c.b.a.name for x in D.objects().select_related(max_depth=30) ]')
    data = D.objects().select_related(max_depth=30)
    [ x.c.b.a.name for x in data ]
    print('Test 3b - Duration: {} / calls: {}'.format(datetime.now() - starttime, cl.counter))

Testing with above model and code and n=10 (=10xA, 100xB, 1000xC, 10000xD), I see following result:

[ x.c.b.a for x in D.objects() ] -> 30k reads -> makes sense as each document and each reference gets deserialized separately
[ x.c.b.a for x in D.objects().select_related() ] -> 2k reads -> D & C are loaded initially, access to B & A cause another 2k readys, makes sense.
[ x.c.b.a for x in D.objects().select_related(max_depth=30) ] -> Would expect no additional reads, but same result as without specifying max_depth. Seems like parameter is ignored and only direct reference is deserialized initially?

### python3 test.py
Test preparations - cleanup document types A,B,C,D
Test preparations - generate documents
Test 1a - access: [ x.c.name for x in D.objects() ]
Test 1a - Duration: 0:00:06.602399 / calls: 10002
Test 1b - access: [ x.c.b.a.name for x in D.objects() ]
Test 1b - Duration: 0:00:16.283449 / calls: 30002
Test 2a - access: [ x.c.name for x in D.objects().select_related() ]
Test 2a - Duration: 0:00:00.485834 / calls: 4
Test 2b - access: [ x.c.b.a.name for x in D.objects().select_related() ]
Test 2b - Duration: 0:00:01.442558 / calls: 2004
Test 3a - access: [ x.c.name for x in D.objects().select_related(max_depth=30) ]
Test 3a - Duration: 0:00:00.512354 / calls: 4
Test 3b - access: [ x.c.b.a.name for x in D.objects().select_related(max_depth=30) ]
Test 3b - Duration: 0:00:01.533210 / calls: 2004
###

--> Test 3b shouldn't have caused additional 2k requests, or?

Appreciate verification or any hint on how to achieve the full load to memory including all referenced objects up to the root, so that no additional DB accesses are performed.

Additional info: I also tested using mongoengine.context_managers.query_counter, but looks like the maximum reported query count 1269 ? for using n <= 4 values are ok, n > 4 generates more than 1269 requests, so all queries show a count of 1269. An additional bug?

Used versions:

Python 3.8.10
mongoengine==0.23.1
pymongo==3.12.1
MongoDB 3.6.8 Community

and also tested with same result using:

mongoengine==0.24.0
pymongo==4.0

Feb 22 '22 19:02 cldmkr

Could be that it is the same issue as issue #1988 and closed PR "select_related ReferenceFields down to max_depth #1226)".

Feb 24 '22 12:02 cldmkr