spark-excel icon indicating copy to clipboard operation
spark-excel copied to clipboard

Error reading xlsx file (MIN_INFLATE_RATIO exceeded)

Open Ftagn92 opened this issue 4 years ago • 28 comments

Hello, I read some xlsx files in a s3 bucket with spark 2.4.4 and crealytics 2.11:0.13.1 artefact One of them raises an error despite of correct content (tested with Libreoffice)

It seems to reach a compression limit and Spark invites to change a ziplib parameter

Is there a way to change this ZipSecureFile.setMinInflateRatio() through the artefact ?

adv2 = spark.read.format("com.crealytics.spark.excel").option(
                    "header", "true").option("inferSchema", "true").load("test.xlsx")

Error message :

Traceback (most recent call last):
  File "/tmp/aws-adv-daily.py", line 304, in main
    "header", "true").option("inferSchema", "true").load("test.xlsx")
  File "/usr/spark-2.4.4/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 166, in load
    return self._df(self._jreader.load(path))
  File "/usr/spark-2.4.4/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/usr/spark-2.4.4/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/usr/spark-2.4.4/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o37.load.
: java.io.IOException: Zip bomb detected! The file would exceed the max. ratio of compressed file size to the size of the expanded data.
This may indicate that the file is used to inflate memory usage and thus could pose a security risk.
You can adjust this limit via ZipSecureFile.setMinInflateRatio() if you need to work with files which exceed this limit.
Uncompressed size: 106496, Raw/compressed size: 859, ratio: 0.008066
Limits: MIN_INFLATE_RATIO: 0.010000, Entry: xl/styles.xml
	at shadeio.poi.openxml4j.util.ZipArchiveThresholdInputStream.checkThreshold(ZipArchiveThresholdInputStream.java:131)
	at shadeio.poi.openxml4j.util.ZipArchiveThresholdInputStream.read(ZipArchiveThresholdInputStream.java:81)
	at shadeio.poi.util.IOUtils.toByteArray(IOUtils.java:152)
	at shadeio.poi.util.IOUtils.toByteArray(IOUtils.java:121)
	at shadeio.poi.util.IOUtils.toByteArray(IOUtils.java:108)
	at shadeio.poi.openxml4j.util.ZipArchiveFakeEntry.<init>(ZipArchiveFakeEntry.java:47)
	at shadeio.poi.openxml4j.util.ZipInputStreamZipEntrySource.<init>(ZipInputStreamZipEntrySource.java:53)
	at shadeio.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:106)
	at shadeio.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:301)
	at shadeio.poi.xssf.usermodel.XSSFWorkbookFactory.createWorkbook(XSSFWorkbookFactory.java:134)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at shadeio.poi.ss.usermodel.WorkbookFactory.createWorkbook(WorkbookFactory.java:339)
	at shadeio.poi.ss.usermodel.WorkbookFactory.createXSSFWorkbook(WorkbookFactory.java:314)
	at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:232)
	at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:198)
	at com.crealytics.spark.excel.DefaultWorkbookReader$$anonfun$openWorkbook$1.apply(WorkbookReader.scala:50)
	at com.crealytics.spark.excel.DefaultWorkbookReader$$anonfun$openWorkbook$1.apply(WorkbookReader.scala:50)
	at scala.Option.fold(Option.scala:158)
	at com.crealytics.spark.excel.DefaultWorkbookReader.openWorkbook(WorkbookReader.scala:50)
	at com.crealytics.spark.excel.WorkbookReader$class.withWorkbook(WorkbookReader.scala:14)
	at com.crealytics.spark.excel.DefaultWorkbookReader.withWorkbook(WorkbookReader.scala:46)
	at com.crealytics.spark.excel.ExcelRelation.excerpt$lzycompute(ExcelRelation.scala:30)
	at com.crealytics.spark.excel.ExcelRelation.excerpt(ExcelRelation.scala:30)
	at com.crealytics.spark.excel.ExcelRelation.headerColumns$lzycompute(ExcelRelation.scala:104)
	at com.crealytics.spark.excel.ExcelRelation.headerColumns(ExcelRelation.scala:103)
	at com.crealytics.spark.excel.ExcelRelation$$anonfun$inferSchema$1.apply(ExcelRelation.scala:168)
	at com.crealytics.spark.excel.ExcelRelation$$anonfun$inferSchema$1.apply(ExcelRelation.scala:167)
	at scala.Option.getOrElse(Option.scala:121)
	at com.crealytics.spark.excel.ExcelRelation.inferSchema(ExcelRelation.scala:167)
	at com.crealytics.spark.excel.ExcelRelation.<init>(ExcelRelation.scala:34)
	at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:40)
	at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:18)
	at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:12)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

Thanks for your help

Ftagn92 avatar Apr 06 '20 17:04 Ftagn92

Interesting.. Have you tried to set this ratio?

ZipSecureFile.setMinInflateRatio(0)
adv2 = spark.read
  .format("com.crealytics.spark.excel")
  .option("header", "true")
  .option("inferSchema", "true")
  .load("test.xlsx")

1

EnverOsmanov avatar Apr 06 '20 18:04 EnverOsmanov

Yes i tried this first, but the command was unregognized. Is there any import i have to add first ?

Ftagn92 avatar Apr 07 '20 02:04 Ftagn92

Interesting.. Have you tried to set this ratio?

ZipSecureFile.setMinInflateRatio(0)
adv2 = spark.read
  .format("com.crealytics.spark.excel")
  .option("header", "true")
  .option("inferSchema", "true")
  .load("test.xlsx")

1

We are also facing same issue, where in it is expected to Read a Huge Excel File , 20 worksheets of 700K each. However, we are using pyspark , this ZipSecureFile.setMinInflateRatio(0) is not available in python. Do let me know if anyone got the solution to it in pyspark ( not scala)?

nuarc avatar Apr 28 '20 20:04 nuarc

In addition, i have tried , it is not getting set.

    ZipSecureFile.setMinInflateRatio(0)

    val data = spark.read.format("com.crealytics.spark.excel")
      .option("header", value = true)
      .option("inferSchema", "true")
      .option("sheetName", "sheetname1")
      .option("maxRowsInMemory", 4)
      .load("./data/sample.xlsx")
    data.printSchema()

Error:

Caused by: java.io.IOException: Zip bomb detected! The file would exceed the max. ratio of compressed file size to the size of the expanded data.
This may indicate that the file is used to inflate memory usage and thus could pose a security risk.
You can adjust this limit via ZipSecureFile.setMinInflateRatio() if you need to work with files which exceed this limit.
Uncompressed size: 973759, Raw/compressed size: 9728, ratio: 0.009990
Limits: MIN_INFLATE_RATIO: 0.010000, Entry: xl/styles.xml

nuarc avatar Apr 30 '20 07:04 nuarc

You might have more luck searching for/reporting this issue upstream, e.g. in Apache POI: https://bz.apache.org/bugzilla/buglist.cgi?product=POI

nightscape avatar Apr 30 '20 09:04 nightscape

Hello,

If you happen to find a solution, i would really love to know it, as i face the same issue here ;(

Zulpitch avatar Apr 30 '20 13:04 Zulpitch

I solved it by writing a spark scala ( python , i couldnt set minInflateRatio on ZipSecureFile to 0. I was making a mistake by using poi native ZipSecureFile version, instead noticed library it leverages is from shadeio, Bit confusing during import.

DONT USE THIS --> import org.apache.poi.openxml4j.util.ZipSecureFile USE THIS --> import shadeio.poi.openxml4j.util.ZipSecureFile

However, on large excel i still find Spark-excel is much slower than pandas ( python), so we are proceeding with pyspark implementation i.e. Parse Large Excel in Pandas --> convert to Pandas DF to Spark DF and proceeding with computations as necessary.

nuarc avatar May 01 '20 20:05 nuarc

Hey @nuarc, great you figured it out! The shadeio thing comes from the fact that we need to shade (i.e. rename packages from) some libraries in order to prevent dependency hell with libraries that come packaged with Spark or other libraries. I haven't done much performance tuning on spark-excel, I think many people here are using this library much more intensively than we do ourselves :wink:

nightscape avatar May 01 '20 23:05 nightscape

Hello,

I figured out why we can't use the ZipSecureFile in Python.

https://docs.databricks.com/release-notes/cluster-images/2.0.2-db4.html

On that link, as we can see, there are lots of libraries that aren't pre-installed in Python.

I've done the test as Scala, it works, but not on Python. It's a complete non-sense to me.

I don't know how to make it work as Python, but i'll figure it out soon i guess .... Thank you for your answers, it helped me a lot !

Zulpitch avatar May 04 '20 13:05 Zulpitch

Hello,

I've made a python workaround. It's not a proper way, but i can't invest time to replace crealytics, parse excel xml content, or rewrite my whole pipeline process in scala (xlsx are just xml in zip files)

import zipfile, os, shutil

[...blahblah spark initialization and other process...]

try:
    #normal files are processed here
    df = spark.read.format("com.crealytics.spark.excel").option("header", "true").option("inferSchema", "true").load(myfile_xlsx)
except:
    #triggers with "zip bomb" files (fortunately they are not, they are produced by on premise systems)
    with zipfile.ZipFile(myfile_xlsx, 'r') as zipObj:
        #unzip and collect xlsx content
        files = zipObj.namelist()
        zipObj.extractall('temp')
    os.remove(myfile_xlsx)
    with zipfile.ZipFile(myfile_xlsx, "w", zipfile.ZIP_DEFLATED) as z: 
        #recreate xlsx file with lower compression ratio
        for file in files: 
            z.write("temp/"+ file, file, compresslevel=2) 
    shutil.rmtree("temp")
    #Now, crealytics works
    df = spark.read.format("com.crealytics.spark.excel").option("header", "true").option("inferSchema", "true").load(myfile_xlsx)

By default, xlsx compression ratio is around 6 Lower it to 2 or 3 did the trick for me (from 0=no compression to 9=highest compression)

Btw, i'm interested with scala sub modules if callable in pyspark (like udf functions)

Hope this helps

EDIT : Note that compresslevel is a new feature in python 3.7

Ftagn92 avatar May 05 '20 12:05 Ftagn92

You're a god. Thank you for this code, i'll test it when i'll have time

Zulpitch avatar May 05 '20 12:05 Zulpitch

Hello @Ftagn92 , I am getting errror that file is not a zip file. I have .xlsx file

rohitthapliyal180 avatar Jul 06 '21 14:07 rohitthapliyal180

@rohitthapliyal180 could it be an .xls file that just has been renamed to .xlsx? Try renaming it to .xls and see if Excel can open it.

nightscape avatar Jul 07 '21 08:07 nightscape

Also you can try to use streaming version of reader by adding "maxRowsInMemory" option.

EnverOsmanov avatar Jul 07 '21 11:07 EnverOsmanov

Hi @rohitthapliyal180 @Ftagn92 and @Zulpitch

I know it's been a long time already. If it is possible for you to share or step to create one excel file with this "MIN_INFLATE_RATIO exceeded" issue. I would like to add it to the test-suite.

Appreciate your help

quanghgx avatar Jul 08 '21 13:07 quanghgx

This can possibly be fixed by this - https://stackoverflow.com/questions/46796874/java-io-ioexception-failed-to-read-zip-entry-source.

Note that spark-excel shades the POI classes. org.apache.poi gets shaded to shadeio.poi

Adding something like this to your code:

shadeio.poi.openxml4j.util.ZipSecureFile.setMinInflateRatio(0)

pjfanning avatar Sep 18 '21 12:09 pjfanning

I tried to import >

import shadeio.poi.openxml4j.util.ZipSecureFile

But got module not found .

Please let me know if there's any other solution in pyspark for reading large excel files.

Projyots avatar Nov 25 '21 18:11 Projyots

Hey,

if you have 5 mins, i can post the solution i found to avoid that problem, give me 5 mins to start my professionnal PC ;)

Zulpitch avatar Nov 25 '21 18:11 Zulpitch

Hey @Projyots

Ok, so, first, you have to understand that the

import shadeio.poi.openxml4j.util.ZipSecureFile

is not available with Python. At least, not natively, even with the import. To make it work, i used the magic : %scala

Because there are way more libraries that are native with scala than with python.

What i did is that i did import the .xlsx file as scala, exported it as CSV, then imported it back as python. I'm sure you can do much better than i did, for example by using delta tables, but i don't want to update my solution, as it works well for now ;

// Imports
import org.apache.spark.sql._
import shadeio.poi.openxml4j.util.ZipSecureFile

// This library isn't available in Python. If we don't set the ratio to 0, it shows the "Zip Bomb Detected" exception.
ZipSecureFile.setMinInflateRatio(0);

// Set the udb root value
val udbRoot = "/mnt/nrb"

// Set the file paths
val blendEuFilePath = udbRoot + "/NaturalRubber/Referentials/T_blocage_EUR.xlsx"
val blendEuCsvFilePath = udbRoot + "/NaturalRubber/Referentials/Blending EU.csv"

// Import the excel file
val blendEuDf = spark.read.format("com.crealytics.spark.excel").option("dataAddress", "ventilation!").option("Header", "false").load(blendEuFilePath)

// Export the dataframe as CSV to reimport it in Python
blendEuDf.repartition(1).write.mode("overwrite").format("com.databricks.spark.csv").option("header", "true").option("delimiter",";").save(blendEuCsvFilePath + ".dir")

// Extract the csv file
val listFiles = dbutils.fs.ls(blendEuCsvFilePath + ".dir/")
for (subFiles <- listFiles)
  if (subFiles.name.takeRight(4) == ".csv"){
    dbutils.fs.cp(blendEuCsvFilePath + ".dir/" + subFiles.name, blendEuCsvFilePath)
  }

dbutils.fs.rm(blendEuCsvFilePath + ".dir/", recurse=true)

If your file is big, you can also use the .option("maxRowsInMemory", 100) when you import the .xlsx, it'll help you to have a faster dataframe.

If you've done the same code as me, just import back your file from your csv with the code : spark.read.csv(blendEuFilePath, sep=";", inferSchema=True, header=True)

I really insist that this solution isn't optimal, but if you don't have anything else, it can give you a workaround.

Hope this helps !

Zulpitch avatar Nov 25 '21 18:11 Zulpitch

Hi @rohitthapliyal180 @Ftagn92 and @Zulpitch

I know it's been a long time already. If it is possible for you to share or step to create one excel file with this "MIN_INFLATE_RATIO exceeded" issue. I would like to add it to the test-suite.

Appreciate your help

Sorry for not answering faster, i didn't see your question before ... :(

Zulpitch avatar Nov 25 '21 18:11 Zulpitch

Hi @rohitthapliyal180 @Ftagn92 and @Zulpitch

I know it's been a long time already. If it is possible for you to share or step to create one excel file with this "MIN_INFLATE_RATIO exceeded" issue. I would like to add it to the test-suite.

Appreciate your help

Hi quandhgx Sure. I can't create it by myself, it's a real file with real datas. Can i send it to you by private mail or whatever ?

Ftagn92 avatar Dec 10 '21 12:12 Ftagn92

Hi @Ftagn92 ,

Appreciate your offer.

However, we will put it, or part of it, into unit-test anyway. To be honest, it is too much for me to receive your nonpublic data. Please help take some time to clear sensitive data.

Thanks once again, Ftagn92 and have a nice weekend.

quanghgx avatar Dec 10 '21 13:12 quanghgx

Hi @Ftagn92 ,

Appreciate your offer.

However, we will put it, or part of it, into unit-test anyway. To be honest, it is too much for me to receive your nonpublic data. Please help take some time to clear sensitive data.

Thanks once again, Ftagn92 and have a nice weekend.

I understand. Unfortunately, Any manipulation of the xlsx file (with excel of any compatible application like libreoffice) results on a new one with no error message (like in my workaround posted on May, 5th) There is no solution other than use the real file directly I have no process to create or recreate a file below the MIN_INFLATE_RATIO

Thanks anyway

Ftagn92 avatar Dec 10 '21 13:12 Ftagn92

Hi, @Zulpitch et al,

I tried the same method but it's too slow in the scala dataframe.

New Approach

Read in pandas Converted the sheet to a csv file Read csv into spark dataframe

To read through pandas in ADLS

import tempfile Copy the files to temp directory from adls Read and process in temp dir Copy back the processed files to adls Continue with the further script

Projyots avatar Jan 15 '22 12:01 Projyots

@rohitthapliyal180 could it be an .xls file that just has been renamed to .xlsx? Try renaming it to .xls and see if Excel can open it.

This worked for me, thanks!

Avinnaash avatar Apr 07 '22 03:04 Avinnaash

if you have issue with import shadeio.poi.openxml4j.util.ZipSecureFile tryimport org.apache.poi.openxml4j.util.ZipSecureFile. then use ZipSecureFile.setMinInflateRatio(0);

I fixed my zip bomb issue with that.

PricklyBramblewood avatar Apr 24 '23 19:04 PricklyBramblewood

Many creative workarounds for pySpark in this thread!

I would be interested in a more native solution for pySpark, such as incorporating the MinInflateRatio setting into the read options, like maxByteArraySize.

Does this idea make sense? Would it help if I contributed a patch?

coryvisi avatar Jul 06 '23 14:07 coryvisi

@coryvisi an extra option in the ExcelOptions class would be useful. See https://github.com/crealytics/spark-excel/pull/628

pjfanning avatar Jul 06 '23 14:07 pjfanning