django-import-export icon indicating copy to clipboard operation
django-import-export copied to clipboard

Performance: slow export with ForeignKeys id

Open PetrDlouhy opened this issue 4 years ago • 7 comments

I tried to investigate slow export on my model. I found out, that if I write dehydrate function on ModelResource like this:

    def dehydrate_user_profile(self, search_report):
        return search_report.user_profile_id

The export rapidly speeds up. I think, that django-import-export has bad method of getting the id values for ForeignKeys.

The bad performance is similar to this (notice the . instead of _)

    def dehydrate_user_profile(self, search_report):
        return search_report.user_profile.id

PetrDlouhy avatar Jul 19 '19 18:07 PetrDlouhy

Hey @PetrDlouhy ! Thanks for the investigation. How much of an increase would you say? Would you like to submit a PR with some of these changes?

andrewgy8 avatar Jul 24 '19 06:07 andrewgy8

@andrewgy8 In my case the increase was extreme - when I try with just 1300 records it is almost instant vs. 20 seconds.

I investigated a little bit further. The problem is, that on line 90 in fields.py in get_value() function the value of foreign key is fetched by:

value = getattr(value, attr, None)

If the value is not already cached or not fetch by the queryset with select_related, it will trigger new DB query for every row of the export. That value is further processed by Widget.render() where the value of certain key is fetched from the related object. This is unnecessary, if the key we want is primary key of the ForeignKey relation.

I am not sure, how to fix this properly. One method might be to detect, that we want to fetch pk/id key and bypass the get_value()->Widget.render() mechanism. Downside is that it might change behaviour if one of these function was overridden by user.

Second method is to automatically include select_related to the queryset. Upside is that it can greatly improve performance not only for primary keys, but also for related fields. Downside is that it might not work for some user modified querysets properly. Also in some cases it can lead to slowdown, because the select_related would increase the query complexity and/or fetch unwanted data (which might be solved by using only).

PetrDlouhy avatar Jul 24 '19 09:07 PetrDlouhy

Any ideas how to improve this?

ghost avatar Dec 18 '19 20:12 ghost

Antoher option, if you are doing special things inside a derived custom ForeignKeyWidget is to inject a chached queryset when the before_import method fires for the rousource.

def before_import(self, dataset, using_transactions, dry_run, **kwargs): self.fields['my_column_name'].widget.cached_dataset1 = SomeModel1.objects.all() self.fields['my_column_name'].widget.cached_dataset2 = SomeModel2.objects.all()

This is also not an optimal solution as it pulls in all data for the cached dataset, which migth be inefficient when they are large models.

JohnieBraaf avatar Jul 13 '21 07:07 JohnieBraaf

A hotfix for now, overriding get_export_queryset:

  def get_export_queryset(self, request):
      return (
          super(YourModelAdmin, self)
          .get_export_queryset(request)
          .select_related()
      )

jfilter avatar Sep 01 '21 11:09 jfilter

@jfilter 's answer works great when you're exporting from the admin. If you'll be exporting from the shell or your views as well, you could override get_queryset directly on your Resource, which should cover all cases:

def get_queryset(self):
    return super().get_queryset().selected_related('RELATED_MODEL1', 'RELATED_MODEL2', ...)

Maybe amending the Getting started section of the docs to suggest users do this if they're doing related lookups could be all that's needed?

Alternatively, an idea could be to mirror ModelAdmin's list_select_related in Resource, maybe as follows:

class ModelResource(Resource, metaclass=ModelDeclarativeMetaclass):
    ...
    def get_queryset(self):
        """
        Returns a queryset of all objects for this model. Override this if you
        want to limit the returned queryset.
        """
        qs = self._meta.model.objects.all()
        select_related = getattr(self, 'export_select_related`, None)
        if select_related:
            qs = qs.select_related(*select_related)
        return qs

class MyResource(resources.ModelResource):
    export_selected_related = ('RELATED_MODEL1', 'RELATED_MODEL2', ...)

Can do a proper PR later if this looks sensible.

mikaraunio avatar Feb 17 '22 00:02 mikaraunio

@mikaraunio Thanks for this update. I feel that updating the 'Getting started' docs would be a great way to move this forward. If you are able to submit a PR for this documentation update that would be appreciated.

matthewhegarty avatar Feb 17 '22 08:02 matthewhegarty

See also #1715

matthewhegarty avatar Dec 14 '23 19:12 matthewhegarty

@PetrDlouhy Is now a good time to resolve this issue?

matthewhegarty avatar Dec 14 '23 19:12 matthewhegarty

Some steps to reproduce with test app.

Create a sample dataset (books with authors)

  1. Add the following to core/admin.py (BookResource)
def before_import_row(self, row, **kwargs):
    author_name = row["author_name"]
    obj, created = Author.objects.get_or_create(name=author_name, defaults={"name": author_name})
    row["author"] = obj.id
  1. Import the attached file (1294-books.tsv) via Admin UI

Test Export

Add the following to core/settings.py. (this will log SQL output)

LOGGING = {
    "version": 1,
    "disable_existing_loggers": False,
    "handlers": {
        "console": {"level": "DEBUG", "class": "logging.StreamHandler"},
    },
    "loggers": {
        "django.db.backends": {"level": "DEBUG", "handlers": ["console"]},
        "django.server": {
            "handlers": ["console"],
            "level": "WARNING",
            "propagate": False,
        },
    }
}

Export books

Now run an export of books and note that the SQL output contains a SELECT for each author (and category). See attached.

1294-books.tsv.txt

sql_output.txt

When the fix in #1717 is applied, the log does not show the queries for Author select.

Adding the following to BookResource also prevents the additional Author SELECTs being run:

    def filter_export(self, queryset, **kwargs):
        return queryset.select_related("author")

matthewhegarty avatar Dec 20 '23 12:12 matthewhegarty