django-import-export
django-import-export copied to clipboard
Performance: slow export with ForeignKeys id
I tried to investigate slow export on my model. I found out, that if I write dehydrate
function on ModelResource
like this:
def dehydrate_user_profile(self, search_report):
return search_report.user_profile_id
The export rapidly speeds up. I think, that django-import-export
has bad method of getting the id
values for ForeignKeys.
The bad performance is similar to this (notice the .
instead of _
)
def dehydrate_user_profile(self, search_report):
return search_report.user_profile.id
Hey @PetrDlouhy ! Thanks for the investigation. How much of an increase would you say? Would you like to submit a PR with some of these changes?
@andrewgy8 In my case the increase was extreme - when I try with just 1300 records it is almost instant vs. 20 seconds.
I investigated a little bit further. The problem is, that on line 90
in fields.py
in get_value()
function the value of foreign key is fetched by:
value = getattr(value, attr, None)
If the value is not already cached or not fetch by the queryset with select_related
, it will trigger new DB query for every row of the export.
That value is further processed by Widget.render()
where the value of certain key is fetched from the related object. This is unnecessary, if the key we want is primary key of the ForeignKey relation.
I am not sure, how to fix this properly.
One method might be to detect, that we want to fetch pk
/id
key and bypass the get_value()->Widget.render()
mechanism.
Downside is that it might change behaviour if one of these function was overridden by user.
Second method is to automatically include select_related
to the queryset.
Upside is that it can greatly improve performance not only for primary keys, but also for related fields.
Downside is that it might not work for some user modified querysets properly. Also in some cases it can lead to slowdown, because the select_related
would increase the query complexity and/or fetch unwanted data (which might be solved by using only
).
Any ideas how to improve this?
Antoher option, if you are doing special things inside a derived custom ForeignKeyWidget is to inject a chached queryset when the before_import method fires for the rousource.
def before_import(self, dataset, using_transactions, dry_run, **kwargs): self.fields['my_column_name'].widget.cached_dataset1 = SomeModel1.objects.all() self.fields['my_column_name'].widget.cached_dataset2 = SomeModel2.objects.all()
This is also not an optimal solution as it pulls in all data for the cached dataset, which migth be inefficient when they are large models.
A hotfix for now, overriding get_export_queryset
:
def get_export_queryset(self, request):
return (
super(YourModelAdmin, self)
.get_export_queryset(request)
.select_related()
)
@jfilter 's answer works great when you're exporting from the admin. If you'll be exporting from the shell or your views as well, you could override get_queryset
directly on your Resource
, which should cover all cases:
def get_queryset(self):
return super().get_queryset().selected_related('RELATED_MODEL1', 'RELATED_MODEL2', ...)
Maybe amending the Getting started section of the docs to suggest users do this if they're doing related lookups could be all that's needed?
Alternatively, an idea could be to mirror ModelAdmin's list_select_related
in Resource
, maybe as follows:
class ModelResource(Resource, metaclass=ModelDeclarativeMetaclass):
...
def get_queryset(self):
"""
Returns a queryset of all objects for this model. Override this if you
want to limit the returned queryset.
"""
qs = self._meta.model.objects.all()
select_related = getattr(self, 'export_select_related`, None)
if select_related:
qs = qs.select_related(*select_related)
return qs
class MyResource(resources.ModelResource):
export_selected_related = ('RELATED_MODEL1', 'RELATED_MODEL2', ...)
Can do a proper PR later if this looks sensible.
@mikaraunio Thanks for this update. I feel that updating the 'Getting started' docs would be a great way to move this forward. If you are able to submit a PR for this documentation update that would be appreciated.
See also #1715
@PetrDlouhy Is now a good time to resolve this issue?
Some steps to reproduce with test app.
Create a sample dataset (books with authors)
- Add the following to
core/admin.py
(BookResource
)
def before_import_row(self, row, **kwargs):
author_name = row["author_name"]
obj, created = Author.objects.get_or_create(name=author_name, defaults={"name": author_name})
row["author"] = obj.id
- Import the attached file (1294-books.tsv) via Admin UI
Test Export
Add the following to core/settings.py
. (this will log SQL output)
LOGGING = {
"version": 1,
"disable_existing_loggers": False,
"handlers": {
"console": {"level": "DEBUG", "class": "logging.StreamHandler"},
},
"loggers": {
"django.db.backends": {"level": "DEBUG", "handlers": ["console"]},
"django.server": {
"handlers": ["console"],
"level": "WARNING",
"propagate": False,
},
}
}
Export books
Now run an export of books and note that the SQL output contains a SELECT for each author (and category). See attached.
When the fix in #1717 is applied, the log does not show the queries for Author select.
Adding the following to BookResource
also prevents the additional Author SELECTs being run:
def filter_export(self, queryset, **kwargs):
return queryset.select_related("author")