sxl Is it possible to read files by chunk (read file from flask stream)

Is it possible to read files by chunk (read file from flask stream)

Open MohamedLEGH opened this issue 4 years ago • 4 comments

Hello, Thanks for your work. I want to parse a excel file by chunk on my Flask web server (I want to read the first 1000 lines). I read the file from the http stream so I only receive it chunk by chunk. I tried to use the sxl library but I received an error: raise BadZipFile("File is not a zip file") I think that because the zipfile library need the full file to read the headers of the file at the end. Do someone know a workaround ?

Nov 20 '20 13:11 MohamedLEGH

Hi this library is great! there really is no efficient way to read XLS(X) files efficiently.

Is this possible anywhere -- to read only a chunk of the file (like the first 4096 kbs) and get the first couple rows from it?

Aug 10 '22 20:08 arun-bedrock

Hi - glad you like it and thanks for the feedback :) To get the first few rows from a standard xlsx file is easy (this is the example given in the README):

from sxl import Workbook
wb = Workbook("filepath")
ws = wb.sheets['sheet name'] # or, for example, wb.sheets[1]
for row in ws.rows:
    print(row)

This won't do any more work than it needs to and will be very fast even on huge files. That being said, I'm not sure how to address the original question (or, I think, yours). When you don't receive the zip file at 1x (only in chunks), this approach won't work. I'm not sure how to solve that problem other than queuing up the zip until you have everything, then running it through this library.

Aug 11 '22 22:08 ktr

Did some profiling btw: To read a header with SXL on a 10 mb xls it takes 3.26 seconds vs 11 seconds with Openpyxl on read only mode and 23 seconds with PyExcel. It's significantly faster! Thanks again for creating this library!

Thanks I'm doing that approach and it works great! Do you know why it's not possible to work with a chunk of an excel file (like the first X bytes) and get some good guess at what the first couple rows might be?

Aug 11 '22 22:08 arun-bedrock

Glad to hear it :) You can't read the (e.g.,) first 50 bytes to determine the first few rows because xlsx files are zip files (e.g., open them up with 7-zip and review the contents) and zip files don't store data the same way a "normal file" does. I.e., zip changes things around to be "efficient" to store, but that means it's a bit more complex to extract the information. Hope that helps and good luck!

Aug 12 '22 00:08 ktr

sxl sxl copied to clipboard

Is it possible to read files by chunk (read file from flask stream)

sxl
sxl copied to clipboard