datatable icon indicating copy to clipboard operation
datatable copied to clipboard

Upgrade to xlrd 2.0.0 + openpyxl

Open carrascomj opened this issue 3 years ago • 6 comments

Related to #2632

Description

xlrd dropped support for anything but xls at 2.0.0 https://github.com/python-excel/xlrd/pull/371. Using the old version may cause security vulnerabilities and potential incorrect parsing. It is also a problem for people that have installed pandas in their environment (with, more likely, openpyxl and xlrd>2.0).

Implementation

  • Use openpyxl for xlsx, the recommended alternative.
  • openpyxl was added as extra requirements and xlrd was upgraded to >=2.0.0.

I would like to know if this is desirable before writing any unitests.

Thanks!

carrascomj avatar Oct 28 '21 22:10 carrascomj

I implemented some tests in the same fashion as the ones for jay. For them to run on CI, openpyxl has to be installed during the appveyor build (here and in the subsequent pip-installs, I guess).

carrascomj avatar Oct 30 '21 13:10 carrascomj

@carrascomj thanks for your contribution! Nice to know we can't use xlrd for .xlsx anymore. Let me review it and also see why some tests are failing on jenkins.

@samukweku If you only adjust the title to [ENH] ..., it will still be complicated to find PR's like that later. More effective solution is to assign a label improve. Then all the improvements could be seen as https://github.com/h2oai/datatable/issues?q=%22improve%22+label%3Aimprove

oleksiyskononenko avatar Nov 01 '21 19:11 oleksiyskononenko

@carrascomj do you have any ideas if the issue mentioned here has ever been fixed? Have you had a chance to test openpyxl on some larger files?

oleksiyskononenko avatar Nov 04 '21 01:11 oleksiyskononenko

@carrascomj do you have any ideas if the issue mentioned here has ever been fixed? Have you had a chance to test openpyxl on some larger files?

Ups, I had not seen that issue nor the PR associated with it, sorry. In terms of the mentioned 10x performance decrease, it was a problem with an edge-case: the xlsx contained a link to another file (see https://github.com/pandas-dev/pandas/pull/35029#issuecomment-734888490).

With that said, the code in this PR is far from optimal. I will run some benchmarks and see how far openpyxl can go. However, I agree that a custom parser goes more in line with the perfomance (like it was suggested before) expected from datatable.

carrascomj avatar Nov 04 '21 11:11 carrascomj

I tested the performance with master (4 columns, 200000 rows: int, float, datetime, str) and there is a 2x performance decrease, which is consistent with reported decreases in pandas.

The changes on 977af71 do not change performance but simplify the code.

---------------------------------------------------- benchmark: 2 tests ------------------------------------
Name (time in s)        Min      Max     Mean  StdDev   Median     IQR  Outliers     OPS  Rounds  Iterations
------------------------------------------------------------------------------------------------------------
openpyxl_this       19.3181  20.0605  19.6741  0.3141  19.7758  0.5201       2;0  0.0508       5           1
xlrd_upstream       10.6452  10.8720  10.7272  0.0898  10.7248  0.1125       1;0  0.0932       5           1
------------------------------------------------------------------------------------------------------------

carrascomj avatar Nov 04 '21 21:11 carrascomj

Thanks for testing @carrascomj. "2x performance decrease" seems quite significant...

oleksiyskononenko avatar Nov 09 '21 18:11 oleksiyskononenko