datatable
datatable copied to clipboard
Upgrade to xlrd 2.0.0 + openpyxl
Related to #2632
Description
xlrd dropped support for anything but xls
at 2.0.0 https://github.com/python-excel/xlrd/pull/371. Using the old version may cause security vulnerabilities and potential incorrect parsing. It is also a problem for people that have installed pandas in their environment (with, more likely, openpyxl and xlrd>2.0).
Implementation
- Use openpyxl for
xlsx
, the recommended alternative. - openpyxl was added as extra requirements and xlrd was upgraded to >=2.0.0.
I would like to know if this is desirable before writing any unitests.
Thanks!
I implemented some tests in the same fashion as the ones for jay. For them to run on CI, openpyxl has to be installed during the appveyor build (here and in the subsequent pip-installs, I guess).
@carrascomj thanks for your contribution! Nice to know we can't use xlrd
for .xlsx
anymore. Let me review it and also see why some tests are failing on jenkins.
@samukweku If you only adjust the title to [ENH] ...
, it will still be complicated to find PR's like that later. More effective solution is to assign a label improve
. Then all the improvements could be seen as https://github.com/h2oai/datatable/issues?q=%22improve%22+label%3Aimprove
@carrascomj do you have any ideas if the issue mentioned here has ever been fixed? Have you had a chance to test openpyxl
on some larger files?
@carrascomj do you have any ideas if the issue mentioned here has ever been fixed? Have you had a chance to test
openpyxl
on some larger files?
Ups, I had not seen that issue nor the PR associated with it, sorry. In terms of the mentioned 10x performance decrease, it was a problem with an edge-case: the xlsx contained a link to another file (see https://github.com/pandas-dev/pandas/pull/35029#issuecomment-734888490).
With that said, the code in this PR is far from optimal. I will run some benchmarks and see how far openpyxl can go. However, I agree that a custom parser goes more in line with the perfomance (like it was suggested before) expected from datatable.
I tested the performance with master (4 columns, 200000 rows: int, float, datetime, str
) and there is a 2x performance decrease, which is consistent with reported decreases in pandas.
The changes on 977af71 do not change performance but simplify the code.
---------------------------------------------------- benchmark: 2 tests ------------------------------------
Name (time in s) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
------------------------------------------------------------------------------------------------------------
openpyxl_this 19.3181 20.0605 19.6741 0.3141 19.7758 0.5201 2;0 0.0508 5 1
xlrd_upstream 10.6452 10.8720 10.7272 0.0898 10.7248 0.1125 1;0 0.0932 5 1
------------------------------------------------------------------------------------------------------------
Thanks for testing @carrascomj. "2x performance decrease" seems quite significant...