datapusher icon indicating copy to clipboard operation
datapusher copied to clipboard

XLSX no longer processing due to xlrd

Open bushong1 opened this issue 2 years ago • 6 comments

So it looks like the dependency messytables uses xlrd for excel file processing. The latest xlrd does not support XLSX files anymore due to, as I understand it, security concerns. messytables appears to be a dead project, not having had any activity in the last 2 years. This stack overflow post says that xlrd should be swapped out for openpyxl, but with messytables being unmaintained, that seems unlikely to happen. Is there any effort being taken to support XLSX files?

bushong1 avatar Aug 31 '21 12:08 bushong1

@bushong1 there might be some work from our side to fix this but not yet been confirmed. Also, I'd consider replacing Datapusher with Aircan but you'd need to create a new DAG for XLSX loading.

anuveyatsu avatar Jan 26 '22 07:01 anuveyatsu

When I try to upload an XLSX-file the state remains "pending" forever, which is odd

fishbone1 avatar Feb 02 '22 14:02 fishbone1

It seems to me that the option is to replace messytables dependency with its sucesor frictionless

categulario avatar Mar 23 '22 20:03 categulario

We're also seeing that some .ods files aren't processed well by messytables, essentially causing OOM errors consuming >4G of memory. (among other reasons, it's doing zipfile extraction into memory, and potentially duplicating cells in rows many times to fill a large empty spreadsheet).

EricSoroos avatar Mar 29 '22 11:03 EricSoroos