PyExcelerate icon indicating copy to clipboard operation
PyExcelerate copied to clipboard

Large dataframes containing strings generate invalid xlsx files

Open EdGaere opened this issue 2 years ago • 15 comments

For large dataframes containing strings with 500k rows and 60+ columns, pyexcelerate generates invalid .xlsx files that cannot be opened in Excel. Excel gives the error message: "We found a problem with some content in yourfile.xlsx. Do you want us to try to recover as much as we can?". Answering yes removes all invalid content, which yields an empty worksheet.

from string import ascii_lowercase
from pandas import DataFrame
from pyexcelerate import Workbook
from uuid import uuid4

num_rows = 501000
num_cols = 64

# generate dataframe with column names
column_names = [ f"Col{idx}" for idx in range(0, num_cols)]
df = DataFrame(random.choice(list(ascii_lowercase), size=(num_rows, num_cols)), columns=column_names)

# write to file
output_filename = f"/tmp/{str(uuid4())}.xlsx"

wb = Workbook()
ws = wb.new_sheet("sheet name", data=df.values.tolist())
wb.save(output_filename)

The xlsx file is successfully written to disk, is 138Mb in size, but generates the above-mentionned error when being opened in Excel. Further inspection when opening the file in Python reveals that the Zip archive is invalid.

Interestingly, the problem only occurs for large dataframes filled with strings. The problem does not occur:

  • For smaller dataframes
  • For dataframes of the same size filled only with integers

Versions:

  • OS: macOS Catalina 10.15.7; Darwin Kernel Version 19.6.0
  • PyExcelerate==0.10.0
  • pandas==1.3.5

EdGaere avatar Mar 02 '22 15:03 EdGaere

Have you tried splitting across multiple worksheets? While technically allowed by the specification, Excel's behavior above 16384 columns is not guaranteed: https://support.microsoft.com/en-us/office/excel-specifications-and-limits-1672b34d-7043-467e-8e27-269d656771c3

kevmo314 avatar Mar 02 '22 15:03 kevmo314

Hey Kevin,

Thanks for the quick response! I am only writing 64 columns... so that it way below Excel's 16K limit.

In terms of rows, the problem occurs at around 500K rows, which is also below Excel's 1M limit.

Edward

EdGaere avatar Mar 02 '22 15:03 EdGaere

Ah apologies, I misread the message. That's a bit odd, I wonder if the worksheets support having that many inline strings. The relevant line is here: https://github.com/kz26/PyExcelerate/blob/dev/pyexcelerate/Writer.py#L78

If the zip file being generated is invalid, then we can probably open it up and see what's being generated, hopefully something sticks out as incorrect. If you can take a look a contribution would be super welcome, otherwise I'll try to take a look sometime soon.

kevmo314 avatar Mar 02 '22 15:03 kevmo314

When writing the same dataframe with Pandas/openpyxl, there are no problems with the generated file. So it appears Excel can support the volume of inline strings.

Of course openpyxl takes 3x longer and consumes 10x more memory

EdGaere avatar Mar 02 '22 16:03 EdGaere

If I recall correctly, openpyxl creates a table of strings, it doesn't inline the strings. Therefore instead of writing 500000*60 strings, openpyxl will only write 60. PyExcelerate relies on the deflate algorithm to reduce the file size, we had profiled the string compression approach and maintaining that table took a lot longer than it was worth. So they're not quite directly comparable. I suspect the issue is probably somewhere in the produced worksheet xml.

kevmo314 avatar Mar 02 '22 16:03 kevmo314

Understood. I just realised that in the example I provided there are only 26 unique values of the strings (string.ascii_lowercase) across all 500'000 * 64 cells.

df = DataFrame(random.choice(list(ascii_lowercase), size=(num_rows, num_cols)),

EdGaere avatar Mar 02 '22 16:03 EdGaere

Is there a fix or workaround for this yet? I am also facing the same issue. In my case, I have around 1.4M rows. So I am splitting them into two. I am writing the first 750K in one sheet and the rest in the second sheet. But it generates invalid excel and when MSExcel tries to restore the data, it shows both sheets completely empty.

image

bsam-parsionate avatar May 02 '22 07:05 bsam-parsionate

I also have a large dataset that is generating invalid xlsx files, however I'm able to open and repair the files. Excel lists the repaired records as Repaired Records: String properties from /xl/worksheets/sheet1.xml part

jonathancyu avatar Jun 13 '22 16:06 jonathancyu

I am having a similar problem. I am generating a very large file with about 8M rows and about 10 columns with a mix of short text strings (~10 char) and small integers. Resulting file is about 430MB. Trying to read the file in using openpyxl gives

zipfile.BadZipFile: File is not a zip file

Python 3.9.9 pyexcelerate: 0.10.0 openpyxl: 3.0.10

ejhorow avatar Jun 24 '22 15:06 ejhorow

Is there a fix or workaround for this yet? I am also facing the same issue. In my case, I have around 1.4M rows. So I am splitting them into two. I am writing the first 750K in one sheet and the rest in the second sheet. But it generates invalid excel and when MSExcel tries to restore the data, it shows both sheets completely empty.

image

This seems like an issue with the xml generation. I'm happy to review a pull request and it seems that a fix would be welcome as it affects quite a few people.

kevmo314 avatar Jun 27 '22 16:06 kevmo314

Changing like 78 of Writer.py to with zf.open("xl/worksheets/sheet%s.xml" % (index), mode="w", force_zip64=True) as f: allows those xml files to be written. Unfortunately opening the resulting file gives the error "Microsoft Excel is waiting for another application to complete an OLE action". You can get around this by killing and reopening excel, opening a blank sheet, and navigating to the file from the Document Recovery pane.

jonathancyu avatar Jun 27 '22 23:06 jonathancyu

I created a PR to add some logging, and add the force_zip64 change above. https://github.com/kz26/PyExcelerate/pull/178

If you're running into this issue, give that PR a shot and see if there's any additional debug information that gets produced. I suspect an error is being swallowed.

kevmo314 avatar Jul 01 '22 14:07 kevmo314

This change works for me

with zf.open("xl/worksheets/sheet%s.xml" % (index), mode="w", force_zip64=True) as f:

And I can open the file in excel as well.

The issue seems to be there is a default/assumed limit of 2GB of the file by ZipFile library. From the documentation, I could find this

When writing a file, if the file size is not known in advance but may exceed 2 GiB, pass force_zip64=True to ensure that the header format is capable of supporting large files. If the file size is known in advance, construct a ZipInfo object with file_size set, and use that as the name parameter.

Meanwhile, without force_zip64=True I get this error - file size unexpectedly exceeded zip64 limit

Nishad290 avatar Jul 05 '22 11:07 Nishad290

I solved it by doing this: with zf.open("xl/worksheets/sheet%s.xml" % (index), mode="w", force_zip64=True) as f:

After the file is created I rename it to zip and then do it with this command on linux:

zip -F myfileexcel.zip --out myfileexcel.xlsx

it's not the best practice. But I believe someone will improve this.

stgcorpbr avatar Jul 21 '22 13:07 stgcorpbr