spark-excel
spark-excel copied to clipboard
"org.apache.poi.ss.formula.FormulaParseException: Unused input" when saving dataframe
Expected Behavior
The dataframe should be saved to S3.
Current Behavior
Saving the dataframe to S3 fails with following exception:
[2019-07-30 10:09:17,405] ERROR .jobserver.JobManagerActor [] [akka://JobServer/user/context-supervisor/session-context] - Got Throwable org.apache.poi.ss.formula.FormulaParseException: Unused input [MACHINE (CODE SAMPLE 3000X1,0) BENDING SHEETS, PARTS BARRIER, TABLE PAD LIGHT: 689866399 MS-REF.: IOFIO899683923682 SH 327470] after attempting to parse the formula [ 5 MACHINE (CODE SAMPLE 3000X1,0) BENDING SHEETS, PARTS BARRIER, TABLE PAD LIGHT: 689866399 MS-REF.: IOFIO899683923682 SH 327470] at org.apache.poi.ss.formula.FormulaParser.parse(FormulaParser.java:2033) at org.apache.poi.ss.formula.FormulaParser.parse(FormulaParser.java:170) at org.apache.poi.xssf.usermodel.XSSFCell.setFormula(XSSFCell.java:566) at org.apache.poi.xssf.usermodel.XSSFCell.setCellFormula(XSSFCell.java:544) at com.norbitltd.spoiwo.natures.xlsx.Model2XlsxConversions$.convertCell(Model2XlsxConversions.scala:68) at com.norbitltd.spoiwo.natures.xlsx.Model2XlsxConversions$$anonfun$convertRow$4.apply(Model2XlsxConversions.scala:251) at com.norbitltd.spoiwo.natures.xlsx.Model2XlsxConversions$$anonfun$convertRow$4.apply(Model2XlsxConversions.scala:251) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at com.norbitltd.spoiwo.natures.xlsx.Model2XlsxConversions$.convertRow(Model2XlsxConversions.scala:251) at com.norbitltd.spoiwo.natures.xlsx.Model2XlsxConversions$$anonfun$convertSheet$1.apply(Model2XlsxConversions.scala:275) at com.norbitltd.spoiwo.natures.xlsx.Model2XlsxConversions$$anonfun$convertSheet$1.apply(Model2XlsxConversions.scala:275) at scala.collection.immutable.List.foreach(List.scala:381) at com.norbitltd.spoiwo.natures.xlsx.Model2XlsxConversions$.convertSheet(Model2XlsxConversions.scala:275) at com.norbitltd.spoiwo.natures.xlsx.Model2XlsxConversions$$anonfun$convertWorkbook$4.apply(Model2XlsxConversions.scala:470) at com.norbitltd.spoiwo.natures.xlsx.Model2XlsxConversions$$anonfun$convertWorkbook$4.apply(Model2XlsxConversions.scala:470) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35) at com.norbitltd.spoiwo.natures.xlsx.Model2XlsxConversions$.convertWorkbook(Model2XlsxConversions.scala:470) at com.norbitltd.spoiwo.natures.xlsx.Model2XlsxConversions$XlsxWorkbook.convertAsXlsx(Model2XlsxConversions.scala:557) at com.norbitltd.spoiwo.natures.xlsx.Model2XlsxConversions$XlsxSheet.convertAsXlsx(Model2XlsxConversions.scala:534) at com.crealytics.spark.excel.ExcelFileSaver.save(ExcelFileSaver.scala:38) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:66) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656) at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:656) at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:225) at spark.jobserver.JobManagerActor$$anonfun$getJobFuture$4.apply(JobManagerActor.scala:594) at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
Possible Solution
The code is probably assuming that a cell contains a formula based on the cell content. If so making it configurable would be a possible solution (so that we can opt out of formula inferring)
Steps to Reproduce (for bugs)
Copying and pasting the value into a csv, loading it into spark dataframe, and trying to save it
5 MACHINE (CODE SAMPLE 3000X1,0) BENDING SHEETS, PARTS BARRIER, TABLE PAD LIGHT: 689866399 MS-REF.: IOFIO899683923682 SH 327470
Code used to save the dataframe:
dfToSave .coalesce(1) .write.format("com.crealytics.spark.excel") .option("useHeader", "true") .mode("overwrite") .save(s3path)
Context
User fails to export files from our service
Your Environment
Include as many relevant details about the environment you experienced the bug in
scalaVersion := "2.11.11" sparkVersion := "2.2.0" libraryDependencies ++= Seq( "spark.jobserver" %% "job-server-api" % "0.8.0" % "provided", "spark.jobserver" %% "job-server-extras" % "0.8.0" % "provided", "org.apache.spark" %% "spark-sql" % sparkVersion % "provided", "com.crealytics" %% "spark-excel" % "0.11.0" % "provided", "org.apache.hadoop" % "hadoop-aws" % "2.7.3" % "provided", "com.amazonaws" % "aws-java-sdk" % "1.7.4" % "provided" )
Spark standalone mode. Running in AWS inside docker image.
Can you try 0.12.0
?
Hi! Just tried with 0.12.0, the same issue occurs. I forgot to mention that the "problematic" cell starts with equals sign (crealytics probably infers it's a formula in such case)? Example cell content that fails to be exported: = 5 MACHINE PCS TIPPED 846
We don't do anything special with cells that start with =
, so that logic has to be either in Spoiwo or POI.
As a workaround, you could try prepending a '
to the entire column or the problematic entries.
Hi, I'm also getting this exception, When I have some cell value starts with =, I'm using below mentioned version,
"com.crealytics" % "spark-excel_2.12" % "0.13.5"
Is it possible to expose a configuration to explicitly specify, My processing doesn't require any formula evaluation even if it starts with =
Unfortunately, I can't invest time in this... If you can come up with a PR, I'd be happy to review.
Sure @nightscape. When I have sometime, I will try it out and let you know if I'm able to achieve this.
(edited, my bad. Sorry to santilytics)
Hi @endriju and @sasincj Trying to build a test-case for this issue. From your description, is this the content of the CSV file?
5 MACHINE (CODE SAMPLE 3000X1,0) BENDING SHEETS, PARTS BARRIER, TABLE PAD LIGHT: 689866399 MS-REF.: IOFIO899683923682 SH 327470
However, in other message, you mentioned
= 5 MACHINE PCS TIPPED 846
so I think I might have missed something. If it is possible, please help share part of your CSV file, after removing sensitive data, that can reproduce this issue? Sincerely,
hi @quanghgx, not sure why I've been mentioned here but the previous post was from almost a year ago so I'm afraid I wouldn't remember even if I had been mentioned previously
oh, my bad. I mistagged your id. I am sorry.