spark-excel icon indicating copy to clipboard operation
spark-excel copied to clipboard

"org.apache.poi.ss.formula.FormulaParseException: Unused input" when saving dataframe

Open osevce opened this issue 5 years ago • 9 comments

Expected Behavior

The dataframe should be saved to S3.

Current Behavior

Saving the dataframe to S3 fails with following exception:

[2019-07-30 10:09:17,405] ERROR .jobserver.JobManagerActor [] [akka://JobServer/user/context-supervisor/session-context] - Got Throwable org.apache.poi.ss.formula.FormulaParseException: Unused input [MACHINE (CODE SAMPLE 3000X1,0) BENDING SHEETS, PARTS BARRIER, TABLE PAD LIGHT: 689866399 MS-REF.: IOFIO899683923682 SH 327470] after attempting to parse the formula [ 5 MACHINE (CODE SAMPLE 3000X1,0) BENDING SHEETS, PARTS BARRIER, TABLE PAD LIGHT: 689866399 MS-REF.: IOFIO899683923682 SH 327470] at org.apache.poi.ss.formula.FormulaParser.parse(FormulaParser.java:2033) at org.apache.poi.ss.formula.FormulaParser.parse(FormulaParser.java:170) at org.apache.poi.xssf.usermodel.XSSFCell.setFormula(XSSFCell.java:566) at org.apache.poi.xssf.usermodel.XSSFCell.setCellFormula(XSSFCell.java:544) at com.norbitltd.spoiwo.natures.xlsx.Model2XlsxConversions$.convertCell(Model2XlsxConversions.scala:68) at com.norbitltd.spoiwo.natures.xlsx.Model2XlsxConversions$$anonfun$convertRow$4.apply(Model2XlsxConversions.scala:251) at com.norbitltd.spoiwo.natures.xlsx.Model2XlsxConversions$$anonfun$convertRow$4.apply(Model2XlsxConversions.scala:251) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at com.norbitltd.spoiwo.natures.xlsx.Model2XlsxConversions$.convertRow(Model2XlsxConversions.scala:251) at com.norbitltd.spoiwo.natures.xlsx.Model2XlsxConversions$$anonfun$convertSheet$1.apply(Model2XlsxConversions.scala:275) at com.norbitltd.spoiwo.natures.xlsx.Model2XlsxConversions$$anonfun$convertSheet$1.apply(Model2XlsxConversions.scala:275) at scala.collection.immutable.List.foreach(List.scala:381) at com.norbitltd.spoiwo.natures.xlsx.Model2XlsxConversions$.convertSheet(Model2XlsxConversions.scala:275) at com.norbitltd.spoiwo.natures.xlsx.Model2XlsxConversions$$anonfun$convertWorkbook$4.apply(Model2XlsxConversions.scala:470) at com.norbitltd.spoiwo.natures.xlsx.Model2XlsxConversions$$anonfun$convertWorkbook$4.apply(Model2XlsxConversions.scala:470) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35) at com.norbitltd.spoiwo.natures.xlsx.Model2XlsxConversions$.convertWorkbook(Model2XlsxConversions.scala:470) at com.norbitltd.spoiwo.natures.xlsx.Model2XlsxConversions$XlsxWorkbook.convertAsXlsx(Model2XlsxConversions.scala:557) at com.norbitltd.spoiwo.natures.xlsx.Model2XlsxConversions$XlsxSheet.convertAsXlsx(Model2XlsxConversions.scala:534) at com.crealytics.spark.excel.ExcelFileSaver.save(ExcelFileSaver.scala:38) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:66) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656) at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:656) at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:225) at spark.jobserver.JobManagerActor$$anonfun$getJobFuture$4.apply(JobManagerActor.scala:594) at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

Possible Solution

The code is probably assuming that a cell contains a formula based on the cell content. If so making it configurable would be a possible solution (so that we can opt out of formula inferring)

Steps to Reproduce (for bugs)

Copying and pasting the value into a csv, loading it into spark dataframe, and trying to save it

5 MACHINE (CODE SAMPLE 3000X1,0) BENDING SHEETS, PARTS BARRIER, TABLE PAD LIGHT: 689866399 MS-REF.: IOFIO899683923682 SH 327470

Code used to save the dataframe:

dfToSave .coalesce(1) .write.format("com.crealytics.spark.excel") .option("useHeader", "true") .mode("overwrite") .save(s3path)

Context

User fails to export files from our service

Your Environment

Include as many relevant details about the environment you experienced the bug in

scalaVersion := "2.11.11" sparkVersion := "2.2.0" libraryDependencies ++= Seq( "spark.jobserver" %% "job-server-api" % "0.8.0" % "provided", "spark.jobserver" %% "job-server-extras" % "0.8.0" % "provided", "org.apache.spark" %% "spark-sql" % sparkVersion % "provided", "com.crealytics" %% "spark-excel" % "0.11.0" % "provided", "org.apache.hadoop" % "hadoop-aws" % "2.7.3" % "provided", "com.amazonaws" % "aws-java-sdk" % "1.7.4" % "provided" )

Spark standalone mode. Running in AWS inside docker image.

osevce avatar Jul 30 '19 10:07 osevce

Can you try 0.12.0?

nightscape avatar Jul 30 '19 12:07 nightscape

Hi! Just tried with 0.12.0, the same issue occurs. I forgot to mention that the "problematic" cell starts with equals sign (crealytics probably infers it's a formula in such case)? Example cell content that fails to be exported: = 5 MACHINE PCS TIPPED 846

osevce avatar Jul 31 '19 16:07 osevce

We don't do anything special with cells that start with =, so that logic has to be either in Spoiwo or POI. As a workaround, you could try prepending a ' to the entire column or the problematic entries.

nightscape avatar Jul 31 '19 19:07 nightscape

Hi, I'm also getting this exception, When I have some cell value starts with =, I'm using below mentioned version, "com.crealytics" % "spark-excel_2.12" % "0.13.5" Is it possible to expose a configuration to explicitly specify, My processing doesn't require any formula evaluation even if it starts with =

sasincj avatar Sep 24 '20 12:09 sasincj

Unfortunately, I can't invest time in this... If you can come up with a PR, I'd be happy to review.

nightscape avatar Sep 24 '20 13:09 nightscape

Sure @nightscape. When I have sometime, I will try it out and let you know if I'm able to achieve this.

sasincj avatar Sep 24 '20 13:09 sasincj

(edited, my bad. Sorry to santilytics)

Hi @endriju and @sasincj Trying to build a test-case for this issue. From your description, is this the content of the CSV file?

5 MACHINE (CODE SAMPLE 3000X1,0) BENDING SHEETS, PARTS BARRIER, TABLE PAD LIGHT: 689866399 MS-REF.: IOFIO899683923682 SH 327470

However, in other message, you mentioned

= 5 MACHINE PCS TIPPED 846

so I think I might have missed something. If it is possible, please help share part of your CSV file, after removing sensitive data, that can reproduce this issue? Sincerely,

quanghgx avatar Aug 24 '21 15:08 quanghgx

hi @quanghgx, not sure why I've been mentioned here but the previous post was from almost a year ago so I'm afraid I wouldn't remember even if I had been mentioned previously

santilytics avatar Aug 24 '21 16:08 santilytics

oh, my bad. I mistagged your id. I am sorry.

quanghgx avatar Aug 24 '21 16:08 quanghgx