spark-excel
spark-excel copied to clipboard
[BUG]Issue while Writing a dataframe in xlsx format
Is there an existing issue for this?
- [X] I have searched the existing issues
Current Behavior
The Query is issued against snowflake and its producing the data and I am simply writing the result in one Directory in s3 in xlsx format , Its Failing for the V2 api and regular as well Spark version : 3.1.2 Scala : 2.12 excel version : "3.2.1_0.17.1"
Exception in thread "main" shadeio.poi.ss.formula.FormulaParseException: Parse error near char 0 '' in specified formula ''. Expected cell ref or constant literal
Expected Behavior
It should be able to write to excel format without any problem as Snowflake is producing the result as needed , why this has to be parsed in excel again
sql snippet which is causing problem :
SELECT CASE WHEN RA.CUSTOMER_MC_SUPPL > 99 THEN RA.CUSTOMER_MC_NAME||RA.CUSTOMER_MC_SUPPL WHEN RA.CUSTOMER_MC_SUPPL BETWEEN 10 AND 99 THEN RA.CUSTOMER_MC_NAME||'0'||RA.CUSTOMER_MC_SUPPL ELSE RA.CUSTOMER_MC_NAME||'00'||RA.CUSTOMER_MC_SUPPL END CUSTOMER, RA.CONTRACT_NUMBER,
writing code :
val combinedDf = dfSrcQuery.toDF() .coalesce(1) .write .format("com.crealytics.spark.excel") // Or .format("excel") for V2 implementation .option("dataAddress", "'Sheet1'!A1") .option("header", "true") .option("dateFormat", "yy-mm-dd") .option("timestampFormat", "yyyy-MM-dd HH:mm") .mode("overwrite") // Optional, default: overwrite. .save("s3a:/sp_csb_customer_rates_list_snf/sp_csb_customer_rates_list.xlsx")
Steps To Reproduce
No response
Environment
- Spark version: 3.1.2
- Spark-Excel version:3.2.1_0.17.1
- OS: Linux
- Cluster environment Kubernates cluster EKS
Anything else?
Stack Trace :
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:457)
at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:106)
at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:93)
at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:91)
at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:128)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:848)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:382)
at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:355)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239)
at com.hlag.dmsa.bcm.params.sp_csb_customer_rates_list_snf$.delayedEndpoint$com$hlag$dmsa$bcm$params$sp_csb_customer_rates_list_snf$1(sp_csb_customer_rates_list_snf.scala:95)
at com.hlag.dmsa.bcm.params.sp_csb_customer_rates_list_snf$delayedInit$body.apply(sp_csb_customer_rates_list_snf.scala:6)
at scala.Function0.apply$mcV$sp(Function0.scala:39)
at scala.Function0.apply$mcV$sp$(Function0.scala:39)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)
at scala.App.$anonfun$main$1$adapted(App.scala:80)
at scala.collection.immutable.List.foreach(List.scala:431)
at scala.App.main(App.scala:80)
at scala.App.main$(App.scala:78)
at com.hlag.dmsa.bcm.params.sp_csb_customer_rates_list_snf$.main(sp_csb_customer_rates_list_snf.scala:6)
at com.hlag.dmsa.bcm.params.sp_csb_customer_rates_list_snf.main(sp_csb_customer_rates_list_snf.scala)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Some thoughts on this:
- the spark version 3.1.2 does not match the excel spark version you are using (3.2.1). Pls try with 3.1.2_0.17.1
- the stack trace shows that the issue is in catalyst but the exception comes from poi. Is that the complete trace?
- If you just export as CSV, does that work as expected?
@christianknoepfle @nightscape Its my mistake sorry fat finger , its a Typo , the Version that i am using is Spark : 3.2.1 Scala : 2.12.13 java : azul-11 so this is compatible .
Full Stack Trace : The Query is running against Snowflake which is returning some 90k Rows and the problem is when its trying to write in Excel Format .xlsx
=================== 2022-05-31T09:06:11.437801712Z Exception in thread "main" shadeio.poi.ss.formula.FormulaParseException: Parse error near char 0 ' ' in specified formula ''. Expected cell ref or constant literal 2022-05-31T09:06:11.438045094Z at shadeio.poi.ss.formula.FormulaParser.expected(FormulaParser.java:272) 2022-05-31T09:06:11.438056618Z at shadeio.poi.ss.formula.FormulaParser.parseSimpleFactor(FormulaParser.java:1558) 2022-05-31T09:06:11.438059995Z at shadeio.poi.ss.formula.FormulaParser.percentFactor(FormulaParser.java:1511) 2022-05-31T09:06:11.438062827Z at shadeio.poi.ss.formula.FormulaParser.powerFactor(FormulaParser.java:1498) 2022-05-31T09:06:11.438065539Z at shadeio.poi.ss.formula.FormulaParser.Term(FormulaParser.java:1872) 2022-05-31T09:06:11.438069339Z at shadeio.poi.ss.formula.FormulaParser.additiveExpression(FormulaParser.java:1999) 2022-05-31T09:06:11.438072492Z at shadeio.poi.ss.formula.FormulaParser.concatExpression(FormulaParser.java:1983) 2022-05-31T09:06:11.438075663Z at shadeio.poi.ss.formula.FormulaParser.comparisonExpression(FormulaParser.java:1940) 2022-05-31T09:06:11.438088103Z at shadeio.poi.ss.formula.FormulaParser.intersectionExpression(FormulaParser.java:1913) 2022-05-31T09:06:11.438096755Z at shadeio.poi.ss.formula.FormulaParser.unionExpression(FormulaParser.java:1894) 2022-05-31T09:06:11.438180124Z at shadeio.poi.ss.formula.FormulaParser.parse(FormulaParser.java:2041) 2022-05-31T09:06:11.438191361Z at shadeio.poi.ss.formula.FormulaParser.parse(FormulaParser.java:173) 2022-05-31T09:06:11.438194025Z at shadeio.poi.xssf.usermodel.XSSFCell.setFormula(XSSFCell.java:480) 2022-05-31T09:06:11.438196661Z at shadeio.poi.xssf.usermodel.XSSFCell.setCellFormulaImpl(XSSFCell.java:460) 2022-05-31T09:06:11.438199290Z at shadeio.poi.ss.usermodel.CellBase.setCellFormula(CellBase.java:124) 2022-05-31T09:06:11.438201928Z at shadeio.spoiwo.natures.xlsx.Model2XlsxConversions$.convertCell(Model2XlsxConversions.scala:49) 2022-05-31T09:06:11.438205417Z at shadeio.spoiwo.natures.xlsx.Model2XlsxConversions$.$anonfun$convertRow$6(Model2XlsxConversions.scala:143) 2022-05-31T09:06:11.438208742Z at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) 2022-05-31T09:06:11.438243910Z at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) 2022-05-31T09:06:11.438254516Z at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) 2022-05-31T09:06:11.438266450Z at shadeio.spoiwo.natures.xlsx.Model2XlsxConversions$.convertRow(Model2XlsxConversions.scala:143) 2022-05-31T09:06:11.438269414Z at shadeio.spoiwo.natures.xlsx.Model2XlsxConversions$.$anonfun$writeToExistingSheet$2(Model2XlsxConversions.scala:156) 2022-05-31T09:06:11.438287076Z at scala.collection.immutable.List.foreach(List.scala:431) 2022-05-31T09:06:11.438295871Z at shadeio.spoiwo.natures.xlsx.Model2XlsxConversions$.writeToExistingSheet(Model2XlsxConversions.scala:156) 2022-05-31T09:06:11.438379127Z at shadeio.spoiwo.natures.xlsx.Model2XlsxConversions$.$anonfun$writeToExistingWorkbook$1(Model2XlsxConversions.scala:323) 2022-05-31T09:06:11.438382635Z at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) 2022-05-31T09:06:11.438385478Z at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) 2022-05-31T09:06:11.438388211Z at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38) 2022-05-31T09:06:11.438390890Z at shadeio.spoiwo.natures.xlsx.Model2XlsxConversions$.writeToExistingWorkbook(Model2XlsxConversions.scala:320) 2022-05-31T09:06:11.438393675Z at shadeio.spoiwo.natures.xlsx.Model2XlsxConversions$XlsxWorkbook.writeToExisting(Model2XlsxConversions.scala:420) 2022-05-31T09:06:11.438406082Z at com.crealytics.spark.excel.ExcelFileSaver.writeToWorkbook$1(ExcelFileSaver.scala:37) 2022-05-31T09:06:11.438419208Z at com.crealytics.spark.excel.ExcelFileSaver.save(ExcelFileSaver.scala:45) 2022-05-31T09:06:11.438427491Z at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:54) 2022-05-31T09:06:11.438432705Z at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45) 2022-05-31T09:06:11.438461752Z at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) 2022-05-31T09:06:11.438470926Z at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73) 2022-05-31T09:06:11.438482899Z at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84) 2022-05-31T09:06:11.438501209Z at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:110) 2022-05-31T09:06:11.438507358Z at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) 2022-05-31T09:06:11.438513572Z at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) 2022-05-31T09:06:11.438533702Z at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) 2022-05-31T09:06:11.438538982Z at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) 2022-05-31T09:06:11.438543614Z at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) 2022-05-31T09:06:11.438568310Z at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:110) 2022-05-31T09:06:11.438575310Z at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:106) 2022-05-31T09:06:11.438579703Z at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481) 2022-05-31T09:06:11.438584259Z at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82) 2022-05-31T09:06:11.438600650Z at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481) 2022-05-31T09:06:11.438616926Z at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30) 2022-05-31T09:06:11.438620031Z at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267) 2022-05-31T09:06:11.438631918Z at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263) 2022-05-31T09:06:11.438642260Z at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30) 2022-05-31T09:06:11.438652132Z at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30) 2022-05-31T09:06:11.438669974Z at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:457) 2022-05-31T09:06:11.438675969Z at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:106) 2022-05-31T09:06:11.438684471Z at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:93) 2022-05-31T09:06:11.438693088Z at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:91) 2022-05-31T09:06:11.438701458Z at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:128) 2022-05-31T09:06:11.438722486Z at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:848) 2022-05-31T09:06:11.438801580Z at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:382) 2022-05-31T09:06:11.438807004Z at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:355) 2022-05-31T09:06:11.438810460Z at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239) 2022-05-31T09:06:11.438816572Z at com.hlag.dmsa.bcm.params.sp_csb_customer_rates_list_snf$.delayedEndpoint$com$hlag$dmsa$bcm$params$sp_csb_customer_rates_list_snf$1(sp_csb_customer_rates_list_snf.scala:95) 2022-05-31T09:06:11.438820156Z at com.hlag.dmsa.bcm.params.sp_csb_customer_rates_list_snf$delayedInit$body.apply(sp_csb_customer_rates_list_snf.scala:6) 2022-05-31T09:06:11.438848550Z at scala.Function0.apply$mcV$sp(Function0.scala:39) 2022-05-31T09:06:11.438852337Z at scala.Function0.apply$mcV$sp$(Function0.scala:39) 2022-05-31T09:06:11.438855613Z at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17) 2022-05-31T09:06:11.438873238Z at scala.App.$anonfun$main$1$adapted(App.scala:80) 2022-05-31T09:06:11.438878158Z at scala.collection.immutable.List.foreach(List.scala:431) 2022-05-31T09:06:11.438891639Z at scala.App.main(App.scala:80) 2022-05-31T09:06:11.438896287Z at scala.App.main$(App.scala:78) 2022-05-31T09:06:11.438907164Z at com.hlag.dmsa.bcm.params.sp_csb_customer_rates_list_snf$.main(sp_csb_customer_rates_list_snf.scala:6) 2022-05-31T09:06:11.438912347Z at com.hlag.dmsa.bcm.params.sp_csb_customer_rates_list_snf.main(sp_csb_customer_rates_list_snf.scala) 2022-05-31T09:06:11.438938079Z at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 2022-05-31T09:06:11.438956252Z at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 2022-05-31T09:06:11.438960069Z at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 2022-05-31T09:06:11.439033509Z at java.base/java.lang.reflect.Method.invoke(Method.java:566) 2022-05-31T09:06:11.439044304Z at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) 2022-05-31T09:06:11.439047548Z at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955) 2022-05-31T09:06:11.439050312Z at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) 2022-05-31T09:06:11.439052962Z at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) 2022-05-31T09:06:11.439055698Z at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) 2022-05-31T09:06:11.439058263Z at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043) 2022-05-31T09:06:11.439060884Z at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052) 2022-05-31T09:06:11.439063472Z at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 2022-05-31T09:06:11.441044498Z 22/05/31 09:06:11 INFO ShutdownHookManager: Shutdown hook called 2022-05-31T09:06:11.441660669Z 22/05/31 09:06:11 INFO ShutdownHookManager: Deleting directory /tmp/spark-local-dir-dm/spark-585534ca-a39f-4d64-82e0-bea200e33a5f 2022-05-31T09:06:11.445278273Z 22/05/31 09:06:11 INFO ShutdownHookManager: Deleting directory /tmp/spark-c9c44392-fa99-4b8d-bf9f-651fc5d60194 2022-05-31T09:06:11.448221681Z 22/05/31 09:06:11 INFO MetricsSystemImpl: Stopping s3a-file-system metrics system... 2022-05-31T09:06:11.448399784Z 22/05/31 09:06:11 INFO MetricsSystemImpl: s3a-file-system metrics system stopped. 2022-05-31T09:06:11.448421198Z 22/05/31 09:06:11 INFO MetricsSystemImpl: s3a-file-system metrics system shutdown complete.
===================
If i export as Spark Native Format like CSV or TSV there is no problem
Question is there is no logic in the excel end why its trying to apply any formula
Hi, I followed the stacktrace and tried to understand why poi assumes that the cell is not a string but a formula. Looking a bit left and right I found this in inspoiwo/model/Cell.scala
case v: String => if (v.startsWith("=")) { FormulaCell(v.drop(1), indexOption, styleOption, styleInheritance)
So the logic is similar to excel, everything that starts with a = is assumed to be a formula. As soon as the cell gets a formula type shadeio.spoiwo.natures.xlsx.Model2XlsxConversions$.convertCell(Model2XlsxConversions.scala:49) will try to convert/parse it and then fails in doing so.
So I assume that in one of your rows the data starts with a =
For validating this "theory" you could also prepend the resulting string with some ' or other character before feeding it to spark-excel
the v2 datasource does not use the spoiwo library so the issue with cell values with leading '=' should not happen in the v2 datasource