osf.io
osf.io copied to clipboard
[ENG-561] GUID Query Optimizations
I would love to hear suggestions of more avenues I can explore for optimization.
Purpose
Make changes to ensure that retrieval of Guids is optimal and doesn't make unnecessary calls to the database
Changes
- Modified the access of a user's guid to use the cached property rather than query the database
Most instances (excepting the changes I made) of .guids
being referenced rather than _id
were for objects that used different models for their _id
field or a combination of the guid id and another id. I replaced a single instance of .guids.first()._id
with ._id
. I have listed some of these below so others may see instances that I looked into but didn't think were appropriate to change.
- /osf/metadata/serializers.py
- Can’t be changed because a Files
_id
is inherited by ObjectIDMixin (as opposed to OptionalGuidMixin)
- Can’t be changed because a Files
- /osf/models/collection.py
-
.guids.first()
can’t be changed because it’s being used to filterCollectionSubmission
objects which don’t solely use the guid id as_id
-
- /osf/models/files.py
- Can’t be changed because a Files
_id
is inherited by ObjectIDMixin (as opposed to OptionalGuidMixin)
- Can’t be changed because a Files
- /addons/wiki/models.py
- Can’t be changed because
Comment.root_target
must be filtered on the Guid object of theWikiPage
, not theWikiPage._id
.
- Can’t be changed because
QA Notes
Tested the change in the Admin app locally and it looks fine. QA ensuring the functionality of user_search in the Admin app would be useful.
Documentation
N/A
Side Effects
N/A
Ticket
https://openscience.atlassian.net/browse/ENG-561
I think you can improve the cached query here with something like
@cached_property
def _id(self):
try:
guid = self.guids.values_list('_id', flat=True).first()
except IndexError:
return None
Which is:
SELECT "osf_guid"."_id"
FROM "osf_guid"
WHERE ("osf_guid"."object_id" = 357137 AND "osf_guid"."content_type_id" = 28)
ORDER BY "osf_guid"."created" DESC
LIMIT 1
Vs the original query of:
SELECT "osf_guid"."modified",
"osf_guid"."id",
"osf_guid"."_id",
"osf_guid"."content_type_id",
"osf_guid"."object_id",
"osf_guid"."created"
FROM "osf_guid"
WHERE ("osf_guid"."object_id" = 357141 AND "osf_guid"."content_type_id" = 28)
ORDER BY "osf_guid"."created" DESC
LIMIT 1
AFTER
Benchmarking this I find the times marginally improved, I was using inv shell --print-sql
Also is this line still being used? it should be removed if not.
I think the query you provided is more efficient but it looks like it doesn't cache properly. So when I run uday = OSFUser.objects.get(username='[email protected]')
followed by uday._id
in the shell, a query to the database is run by uday._id
.
I'm not 100% sure why but I'm seeing different results on uday.__dict__._prefetched_objects_cache
depending on the implementation. With the original implementation, it looks like:
'_prefetched_objects_cache': {'guids': <QuerySet [<id:nzm8g, referent:(<OSFUser(u'[email protected]') with guid u'nzm8g'>)>]>},
and with your suggested implementation it looks like:
'_prefetched_objects_cache': {'guids': <QuerySet [<id:nzm8g, referent:(<OSFUser(u'[email protected]') with guid None>)>]>},
.
It doesn't look like guids aren't getting cached on any objects with the values_list
implementation.
I do think that if we can get your implementation to cache properly, we should use it.
For :
'_prefetched_objects_cache': {'guids': <QuerySet [<id:nzm8g, referent:(<OSFUser(u'[email protected]') with guid None>)>]>}
it looks like <OSFUser(u'[email protected]')
isn't saved because it's guid is none, can you save it and refresh the cache? I'm basically just looking at .__id_cache
that seems to be working.
Sorry, I copied+pasted your code and didn't acknowledge that I was trying to run things without returning the guid value. It works as expected and should help at least a little!
Tests are failing because the number of queries exceeds the number specified in specific test's assert statements.
After looking into this a bit more, I believe that the values_list
query that John proposed is more efficient in isolation but is less efficient within the context of the GuidMixin.
When retrieving classes that inherit from the GuidMixin (i.e. OSFUser.objects.all()
), the guids
of that object are 'included' and retrieved as well. The existing _id
property took advantage of this because cls.guids.first()._id
did not need to make any additional calls to the database. The use of values_list
in the _id
property requires a call to the database the first time _id
is run (after that, _id
is cached properly) because the use of values_list
initiates a new queryset. As a result, values_list
leads to many more additional calls to the database and shouldn't be used in this case (it is an example of an 'n+1 query' which I hadn't heard of before).