django-import-export icon indicating copy to clipboard operation
django-import-export copied to clipboard

can we replace tablib dependency with pyexcel

Open jnoortheen opened this issue 4 years ago • 13 comments

The downside of using tablib are that all data is processed in-memory. It is not good for large datasets. Can we replace the dependency with pyexcel.

  • it supports django models natively.
  • it has stream APIs to support large files

jnoortheen avatar Jul 20 '20 07:07 jnoortheen

Does it support all formats that we currently have? json, xls, xlsx, csv, etc.. Im unfamiliar with the proposed lib.

Maybe @claudep could weigh in on the comparison?

andrewgy8 avatar Jul 20 '20 10:07 andrewgy8

pyexcel looks a bit more actively maintained than tablib. I cannot state anything about functionality comparison, but according to the docs, several formats are also supported.

claudep avatar Jul 20 '20 10:07 claudep

@andrewgy8 yes it supports those formats with separate libraries installed.

jnoortheen avatar Jul 20 '20 11:07 jnoortheen

Cool. The next steps I propose would be to:

  • ask the kind folks at pyexcel what they think?
  • see how many places we use tablib, and if it can be easily replaced.
  • do a spike to see if its possible to replace the functionality in django import export.
  • test the performance results of importing/exporting using pyexcel.

Then when thats all done, we need to consider how the rollout process would be like. Do we allow our users to use either tablib or pyexcel? Or are we opinionated about the libraries?

Can I assign this issue to you @jnoortheen ?

andrewgy8 avatar Jul 20 '20 11:07 andrewgy8

there is this library already https://github.com/pyexcel-webwares/django-excel.

But it doesn't have anything like Resource class ...

Yes I can work on this weekend.

jnoortheen avatar Jul 20 '20 11:07 jnoortheen

You could also consider Pandas as a drop-in replacement for Tablib - i.e. it just reads or writes a DataFrame in the appropriate format, and have Import-Export iterate over the rows in the dataframe using itertuples() to do the actual import. Similarly it is trivial to build a DataFrame from a queryset using df = pd.DataFrame.from_records(queryset.iterator()). The use of .iterator() ensures that Django doesn't cache queryset and helps with memory usage. Given that Pandas has better memory usage than Tablib, that would probably be an overall win to start with.

But the even bigger win would be that we are passing DataFrames into before_import rather than Tablib datasets, because then we open up all sorts of efficient data manipulations for mapping values or applying functions, many of which can be vectorized which gives order of magnitude improvements over iterating over them in Python to apply a function row by row.

#1080 might be relevant, but I think using Pandas as a drop-in replacement for tablib is a great idea, provided that we don't think people will struggle with installing Pandas on some operating systems. It works great on Linux, and on Anaconda on all operating systems, but it depends on Numpy and I am not sure if binary packages are available for plain Python on OSX, say.

If we do think that Pandas is too heavy as a dependency, then a plugin is probably still a good idea.

rhunwicks avatar Jul 29 '20 01:07 rhunwicks

@rhunwicks I think a backend architecture will work in this. using any one of available dependency. Pandas still has difficulty installing and finding the runtime dependencies in some operating systems. A pure Python package is truly portable.

jnoortheen avatar Jul 29 '20 07:07 jnoortheen

I am happy to work on a Pandas version in parallel, if that will help with the API design for making the dataset read/write pluggable?

rhunwicks avatar Jul 29 '20 12:07 rhunwicks

I tried to take on this but seems like large refactoring needed and it will break some of my own custom Resource's subclasses. As of now I don't work with large files, so this is not as pressing as it was before. Anyone interested please create an issue.

jnoortheen avatar Aug 20 '20 15:08 jnoortheen

Thanks @jnoortheen for the transparency. And I agree, its a major change that will require multiple changes.

However, if someone wants to get started, they could start with one piece of functionality and move it to a pluggable backend. No need to have settings or anything. If I get some time soon, I will try to come up with a little working example, as this sounds like a fun challenge. 😍

andrewgy8 avatar Aug 21 '20 06:08 andrewgy8

Dear @andrewgy8

cool pip, BUT really too many dependency ... whether split optional format into other pluggable pip package ??? just keep python3 core lib csv and json for must format :)

pip install django-import-export
Successfully installed diff-match-patch-20200713 django-import-export-2.4.0 markuppy-1.14 odfpy-1.4.1 tablib-3.0.0 xlrd-1.2.0 xlwt-1.3.0

tmc9031 avatar Dec 06 '20 14:12 tmc9031

I think we could have some issues now because xlrd no longer supports xlsx files. I don't know if this will definitely be an issue, but it has just hit me on another project.

matthewhegarty avatar Dec 12 '20 20:12 matthewhegarty

cool pip, BUT really too many dependency ...

v4 supports optional dependencies

matthewhegarty avatar Oct 18 '23 09:10 matthewhegarty

Closing - see #445 for further discussion on abstracting tablib

matthewhegarty avatar Feb 29 '24 08:02 matthewhegarty