spark-excel
spark-excel copied to clipboard
Error reading xlsx file (MIN_INFLATE_RATIO exceeded)
Hello, I read some xlsx files in a s3 bucket with spark 2.4.4 and crealytics 2.11:0.13.1 artefact One of them raises an error despite of correct content (tested with Libreoffice)
It seems to reach a compression limit and Spark invites to change a ziplib parameter
Is there a way to change this ZipSecureFile.setMinInflateRatio() through the artefact ?
adv2 = spark.read.format("com.crealytics.spark.excel").option(
"header", "true").option("inferSchema", "true").load("test.xlsx")
Error message :
Traceback (most recent call last):
File "/tmp/aws-adv-daily.py", line 304, in main
"header", "true").option("inferSchema", "true").load("test.xlsx")
File "/usr/spark-2.4.4/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 166, in load
return self._df(self._jreader.load(path))
File "/usr/spark-2.4.4/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/spark-2.4.4/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/usr/spark-2.4.4/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o37.load.
: java.io.IOException: Zip bomb detected! The file would exceed the max. ratio of compressed file size to the size of the expanded data.
This may indicate that the file is used to inflate memory usage and thus could pose a security risk.
You can adjust this limit via ZipSecureFile.setMinInflateRatio() if you need to work with files which exceed this limit.
Uncompressed size: 106496, Raw/compressed size: 859, ratio: 0.008066
Limits: MIN_INFLATE_RATIO: 0.010000, Entry: xl/styles.xml
at shadeio.poi.openxml4j.util.ZipArchiveThresholdInputStream.checkThreshold(ZipArchiveThresholdInputStream.java:131)
at shadeio.poi.openxml4j.util.ZipArchiveThresholdInputStream.read(ZipArchiveThresholdInputStream.java:81)
at shadeio.poi.util.IOUtils.toByteArray(IOUtils.java:152)
at shadeio.poi.util.IOUtils.toByteArray(IOUtils.java:121)
at shadeio.poi.util.IOUtils.toByteArray(IOUtils.java:108)
at shadeio.poi.openxml4j.util.ZipArchiveFakeEntry.<init>(ZipArchiveFakeEntry.java:47)
at shadeio.poi.openxml4j.util.ZipInputStreamZipEntrySource.<init>(ZipInputStreamZipEntrySource.java:53)
at shadeio.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:106)
at shadeio.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:301)
at shadeio.poi.xssf.usermodel.XSSFWorkbookFactory.createWorkbook(XSSFWorkbookFactory.java:134)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at shadeio.poi.ss.usermodel.WorkbookFactory.createWorkbook(WorkbookFactory.java:339)
at shadeio.poi.ss.usermodel.WorkbookFactory.createXSSFWorkbook(WorkbookFactory.java:314)
at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:232)
at shadeio.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:198)
at com.crealytics.spark.excel.DefaultWorkbookReader$$anonfun$openWorkbook$1.apply(WorkbookReader.scala:50)
at com.crealytics.spark.excel.DefaultWorkbookReader$$anonfun$openWorkbook$1.apply(WorkbookReader.scala:50)
at scala.Option.fold(Option.scala:158)
at com.crealytics.spark.excel.DefaultWorkbookReader.openWorkbook(WorkbookReader.scala:50)
at com.crealytics.spark.excel.WorkbookReader$class.withWorkbook(WorkbookReader.scala:14)
at com.crealytics.spark.excel.DefaultWorkbookReader.withWorkbook(WorkbookReader.scala:46)
at com.crealytics.spark.excel.ExcelRelation.excerpt$lzycompute(ExcelRelation.scala:30)
at com.crealytics.spark.excel.ExcelRelation.excerpt(ExcelRelation.scala:30)
at com.crealytics.spark.excel.ExcelRelation.headerColumns$lzycompute(ExcelRelation.scala:104)
at com.crealytics.spark.excel.ExcelRelation.headerColumns(ExcelRelation.scala:103)
at com.crealytics.spark.excel.ExcelRelation$$anonfun$inferSchema$1.apply(ExcelRelation.scala:168)
at com.crealytics.spark.excel.ExcelRelation$$anonfun$inferSchema$1.apply(ExcelRelation.scala:167)
at scala.Option.getOrElse(Option.scala:121)
at com.crealytics.spark.excel.ExcelRelation.inferSchema(ExcelRelation.scala:167)
at com.crealytics.spark.excel.ExcelRelation.<init>(ExcelRelation.scala:34)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:40)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:18)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:12)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Thanks for your help
Interesting.. Have you tried to set this ratio?
ZipSecureFile.setMinInflateRatio(0)
adv2 = spark.read
.format("com.crealytics.spark.excel")
.option("header", "true")
.option("inferSchema", "true")
.load("test.xlsx")
Yes i tried this first, but the command was unregognized. Is there any import i have to add first ?
Interesting.. Have you tried to set this ratio?
ZipSecureFile.setMinInflateRatio(0) adv2 = spark.read .format("com.crealytics.spark.excel") .option("header", "true") .option("inferSchema", "true") .load("test.xlsx")
We are also facing same issue, where in it is expected to Read a Huge Excel File , 20 worksheets of 700K each. However, we are using pyspark , this ZipSecureFile.setMinInflateRatio(0) is not available in python. Do let me know if anyone got the solution to it in pyspark ( not scala)?
In addition, i have tried , it is not getting set.
ZipSecureFile.setMinInflateRatio(0)
val data = spark.read.format("com.crealytics.spark.excel")
.option("header", value = true)
.option("inferSchema", "true")
.option("sheetName", "sheetname1")
.option("maxRowsInMemory", 4)
.load("./data/sample.xlsx")
data.printSchema()
Error:
Caused by: java.io.IOException: Zip bomb detected! The file would exceed the max. ratio of compressed file size to the size of the expanded data.
This may indicate that the file is used to inflate memory usage and thus could pose a security risk.
You can adjust this limit via ZipSecureFile.setMinInflateRatio() if you need to work with files which exceed this limit.
Uncompressed size: 973759, Raw/compressed size: 9728, ratio: 0.009990
Limits: MIN_INFLATE_RATIO: 0.010000, Entry: xl/styles.xml
You might have more luck searching for/reporting this issue upstream, e.g. in Apache POI: https://bz.apache.org/bugzilla/buglist.cgi?product=POI
Hello,
If you happen to find a solution, i would really love to know it, as i face the same issue here ;(
I solved it by writing a spark scala ( python , i couldnt set minInflateRatio on ZipSecureFile to 0. I was making a mistake by using poi native ZipSecureFile version, instead noticed library it leverages is from shadeio, Bit confusing during import.
DONT USE THIS --> import org.apache.poi.openxml4j.util.ZipSecureFile USE THIS --> import shadeio.poi.openxml4j.util.ZipSecureFile
However, on large excel i still find Spark-excel is much slower than pandas ( python), so we are proceeding with pyspark implementation i.e. Parse Large Excel in Pandas --> convert to Pandas DF to Spark DF and proceeding with computations as necessary.
Hey @nuarc, great you figured it out! The shadeio thing comes from the fact that we need to shade (i.e. rename packages from) some libraries in order to prevent dependency hell with libraries that come packaged with Spark or other libraries. I haven't done much performance tuning on spark-excel, I think many people here are using this library much more intensively than we do ourselves :wink:
Hello,
I figured out why we can't use the ZipSecureFile in Python.
https://docs.databricks.com/release-notes/cluster-images/2.0.2-db4.html
On that link, as we can see, there are lots of libraries that aren't pre-installed in Python.
I've done the test as Scala, it works, but not on Python. It's a complete non-sense to me.
I don't know how to make it work as Python, but i'll figure it out soon i guess .... Thank you for your answers, it helped me a lot !
Hello,
I've made a python workaround. It's not a proper way, but i can't invest time to replace crealytics, parse excel xml content, or rewrite my whole pipeline process in scala (xlsx are just xml in zip files)
import zipfile, os, shutil
[...blahblah spark initialization and other process...]
try:
#normal files are processed here
df = spark.read.format("com.crealytics.spark.excel").option("header", "true").option("inferSchema", "true").load(myfile_xlsx)
except:
#triggers with "zip bomb" files (fortunately they are not, they are produced by on premise systems)
with zipfile.ZipFile(myfile_xlsx, 'r') as zipObj:
#unzip and collect xlsx content
files = zipObj.namelist()
zipObj.extractall('temp')
os.remove(myfile_xlsx)
with zipfile.ZipFile(myfile_xlsx, "w", zipfile.ZIP_DEFLATED) as z:
#recreate xlsx file with lower compression ratio
for file in files:
z.write("temp/"+ file, file, compresslevel=2)
shutil.rmtree("temp")
#Now, crealytics works
df = spark.read.format("com.crealytics.spark.excel").option("header", "true").option("inferSchema", "true").load(myfile_xlsx)
By default, xlsx compression ratio is around 6 Lower it to 2 or 3 did the trick for me (from 0=no compression to 9=highest compression)
Btw, i'm interested with scala sub modules if callable in pyspark (like udf functions)
Hope this helps
EDIT : Note that compresslevel is a new feature in python 3.7
You're a god. Thank you for this code, i'll test it when i'll have time
Hello @Ftagn92 , I am getting errror that file is not a zip file. I have .xlsx file
@rohitthapliyal180 could it be an .xls
file that just has been renamed to .xlsx
?
Try renaming it to .xls
and see if Excel can open it.
Also you can try to use streaming version of reader by adding "maxRowsInMemory" option.
Hi @rohitthapliyal180 @Ftagn92 and @Zulpitch
I know it's been a long time already. If it is possible for you to share or step to create one excel file with this "MIN_INFLATE_RATIO exceeded" issue. I would like to add it to the test-suite.
Appreciate your help
This can possibly be fixed by this - https://stackoverflow.com/questions/46796874/java-io-ioexception-failed-to-read-zip-entry-source.
Note that spark-excel shades the POI classes. org.apache.poi gets shaded to shadeio.poi
Adding something like this to your code:
shadeio.poi.openxml4j.util.ZipSecureFile.setMinInflateRatio(0)
I tried to import >
import shadeio.poi.openxml4j.util.ZipSecureFile
But got module not found .
Please let me know if there's any other solution in pyspark for reading large excel files.
Hey,
if you have 5 mins, i can post the solution i found to avoid that problem, give me 5 mins to start my professionnal PC ;)
Hey @Projyots
Ok, so, first, you have to understand that the
import shadeio.poi.openxml4j.util.ZipSecureFile
is not available with Python. At least, not natively, even with the import.
To make it work, i used the magic :
%scala
Because there are way more libraries that are native with scala than with python.
What i did is that i did import the .xlsx file as scala, exported it as CSV, then imported it back as python. I'm sure you can do much better than i did, for example by using delta tables, but i don't want to update my solution, as it works well for now ;
// Imports
import org.apache.spark.sql._
import shadeio.poi.openxml4j.util.ZipSecureFile
// This library isn't available in Python. If we don't set the ratio to 0, it shows the "Zip Bomb Detected" exception.
ZipSecureFile.setMinInflateRatio(0);
// Set the udb root value
val udbRoot = "/mnt/nrb"
// Set the file paths
val blendEuFilePath = udbRoot + "/NaturalRubber/Referentials/T_blocage_EUR.xlsx"
val blendEuCsvFilePath = udbRoot + "/NaturalRubber/Referentials/Blending EU.csv"
// Import the excel file
val blendEuDf = spark.read.format("com.crealytics.spark.excel").option("dataAddress", "ventilation!").option("Header", "false").load(blendEuFilePath)
// Export the dataframe as CSV to reimport it in Python
blendEuDf.repartition(1).write.mode("overwrite").format("com.databricks.spark.csv").option("header", "true").option("delimiter",";").save(blendEuCsvFilePath + ".dir")
// Extract the csv file
val listFiles = dbutils.fs.ls(blendEuCsvFilePath + ".dir/")
for (subFiles <- listFiles)
if (subFiles.name.takeRight(4) == ".csv"){
dbutils.fs.cp(blendEuCsvFilePath + ".dir/" + subFiles.name, blendEuCsvFilePath)
}
dbutils.fs.rm(blendEuCsvFilePath + ".dir/", recurse=true)
If your file is big, you can also use the
.option("maxRowsInMemory", 100)
when you import the .xlsx, it'll help you to have a faster dataframe.
If you've done the same code as me, just import back your file from your csv with the code :
spark.read.csv(blendEuFilePath, sep=";", inferSchema=True, header=True)
I really insist that this solution isn't optimal, but if you don't have anything else, it can give you a workaround.
Hope this helps !
Hi @rohitthapliyal180 @Ftagn92 and @Zulpitch
I know it's been a long time already. If it is possible for you to share or step to create one excel file with this "MIN_INFLATE_RATIO exceeded" issue. I would like to add it to the test-suite.
Appreciate your help
Sorry for not answering faster, i didn't see your question before ... :(
Hi @rohitthapliyal180 @Ftagn92 and @Zulpitch
I know it's been a long time already. If it is possible for you to share or step to create one excel file with this "MIN_INFLATE_RATIO exceeded" issue. I would like to add it to the test-suite.
Appreciate your help
Hi quandhgx Sure. I can't create it by myself, it's a real file with real datas. Can i send it to you by private mail or whatever ?
Hi @Ftagn92 ,
Appreciate your offer.
However, we will put it, or part of it, into unit-test anyway. To be honest, it is too much for me to receive your nonpublic data. Please help take some time to clear sensitive data.
Thanks once again, Ftagn92 and have a nice weekend.
Hi @Ftagn92 ,
Appreciate your offer.
However, we will put it, or part of it, into unit-test anyway. To be honest, it is too much for me to receive your nonpublic data. Please help take some time to clear sensitive data.
Thanks once again, Ftagn92 and have a nice weekend.
I understand. Unfortunately, Any manipulation of the xlsx file (with excel of any compatible application like libreoffice) results on a new one with no error message (like in my workaround posted on May, 5th) There is no solution other than use the real file directly I have no process to create or recreate a file below the MIN_INFLATE_RATIO
Thanks anyway
Hi, @Zulpitch et al,
I tried the same method but it's too slow in the scala dataframe.
New Approach
Read in pandas Converted the sheet to a csv file Read csv into spark dataframe
To read through pandas in ADLS
import tempfile Copy the files to temp directory from adls Read and process in temp dir Copy back the processed files to adls Continue with the further script
@rohitthapliyal180 could it be an
.xls
file that just has been renamed to.xlsx
? Try renaming it to.xls
and see if Excel can open it.
This worked for me, thanks!
if you have issue with import shadeio.poi.openxml4j.util.ZipSecureFile
tryimport org.apache.poi.openxml4j.util.ZipSecureFile.
then use ZipSecureFile.setMinInflateRatio(0);
I fixed my zip bomb issue with that.
Many creative workarounds for pySpark in this thread!
I would be interested in a more native solution for pySpark, such as incorporating the MinInflateRatio setting into the read options, like maxByteArraySize.
Does this idea make sense? Would it help if I contributed a patch?
@coryvisi an extra option in the ExcelOptions class would be useful. See https://github.com/crealytics/spark-excel/pull/628