django-import-export
django-import-export copied to clipboard
can we replace tablib dependency with pyexcel
The downside of using tablib are that all data is processed in-memory. It is not good for large datasets. Can we replace the dependency with pyexcel.
- it supports django models natively.
- it has stream APIs to support large files
Does it support all formats that we currently have? json, xls, xlsx, csv, etc.. Im unfamiliar with the proposed lib.
Maybe @claudep could weigh in on the comparison?
pyexcel
looks a bit more actively maintained than tablib
. I cannot state anything about functionality comparison, but according to the docs, several formats are also supported.
@andrewgy8 yes it supports those formats with separate libraries installed.
Cool. The next steps I propose would be to:
- ask the kind folks at pyexcel what they think?
- see how many places we use tablib, and if it can be easily replaced.
- do a spike to see if its possible to replace the functionality in django import export.
- test the performance results of importing/exporting using pyexcel.
Then when thats all done, we need to consider how the rollout process would be like. Do we allow our users to use either tablib or pyexcel? Or are we opinionated about the libraries?
Can I assign this issue to you @jnoortheen ?
there is this library already https://github.com/pyexcel-webwares/django-excel.
But it doesn't have anything like Resource class ...
Yes I can work on this weekend.
You could also consider Pandas as a drop-in replacement for Tablib - i.e. it just reads or writes a DataFrame in the appropriate format, and have Import-Export iterate over the rows in the dataframe using itertuples()
to do the actual import. Similarly it is trivial to build a DataFrame from a queryset using df = pd.DataFrame.from_records(queryset.iterator())
. The use of .iterator()
ensures that Django doesn't cache queryset and helps with memory usage. Given that Pandas has better memory usage than Tablib, that would probably be an overall win to start with.
But the even bigger win would be that we are passing DataFrames into before_import
rather than Tablib datasets, because then we open up all sorts of efficient data manipulations for mapping values or applying functions, many of which can be vectorized which gives order of magnitude improvements over iterating over them in Python to apply a function row by row.
#1080 might be relevant, but I think using Pandas as a drop-in replacement for tablib is a great idea, provided that we don't think people will struggle with installing Pandas on some operating systems. It works great on Linux, and on Anaconda on all operating systems, but it depends on Numpy and I am not sure if binary packages are available for plain Python on OSX, say.
If we do think that Pandas is too heavy as a dependency, then a plugin is probably still a good idea.
@rhunwicks I think a backend architecture will work in this. using any one of available dependency. Pandas still has difficulty installing and finding the runtime dependencies in some operating systems. A pure Python package is truly portable.
I am happy to work on a Pandas version in parallel, if that will help with the API design for making the dataset read/write pluggable?
I tried to take on this but seems like large refactoring needed and it will break some of my own custom Resource's subclasses. As of now I don't work with large files, so this is not as pressing as it was before. Anyone interested please create an issue.
Thanks @jnoortheen for the transparency. And I agree, its a major change that will require multiple changes.
However, if someone wants to get started, they could start with one piece of functionality and move it to a pluggable backend. No need to have settings or anything. If I get some time soon, I will try to come up with a little working example, as this sounds like a fun challenge. 😍
Dear @andrewgy8
cool pip, BUT really too many dependency ... whether split optional format into other pluggable pip package ??? just keep python3 core lib csv and json for must format :)
pip install django-import-export
Successfully installed diff-match-patch-20200713 django-import-export-2.4.0 markuppy-1.14 odfpy-1.4.1 tablib-3.0.0 xlrd-1.2.0 xlwt-1.3.0
I think we could have some issues now because xlrd no longer supports xlsx files. I don't know if this will definitely be an issue, but it has just hit me on another project.
Closing - see #445 for further discussion on abstracting tablib