spark-excel
spark-excel copied to clipboard
Numbers are parsed to numbers when enforceSchema is set to true and type being String
Is there an existing issue for this?
- [X] I have searched the existing issues
Current Behavior
one column value can be large numbers like this 1010675 note:
- enforceSchema is set to true;
- the schema type was specified as String;
- usePlainNumberFormat = true
- this is similar to , but still different to an existing issue(https://github.com/crealytics/spark-excel/issues/126). I worry the title is different and the problem might be ignored, so raised here with explict problem title.
the current read/parse result is: 1010675.0
Expected Behavior
1010675 string not transformed
Steps To Reproduce
No response
Environment
- Spark version: 3.2.2
- Spark-Excel version: 3.2.2_0.18.5
- OS:
- Cluster environment
Anything else?
No response
Hi @FlavioShneider, can you post the exact code, and in the best case also the Excel file to reproduce the issue?
Input Test1.xlsx
Code The code is written in the Azure Databricks (10.4 lts) notebooks style Scala
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.{DataFrame, DataFrameReader}
val schema = s"`productId` STRING, " +
s"`price` DOUBLE "
def inferOrEnforceSchema(
schema: Option[String],
reader: DataFrameReader,
path: String
): DataFrame = {
val df = schema match {
case None => reader.option("inferSchema", "true").load(path)
case Some(schema) => reader.option("enforceSchema", "true").schema(schema).load(path)
}
df
}
val excelReader = spark.read
.format("excel")
.option(
"dataAddress",
"0!A3"
)
.option(
"usePlainNumberFormat",
true
)
.option(
"treatEmptyValuesAsNulls",
true
)
.option("excerptSize", 30)
.option("header", "true")
val result = inferOrEnforceSchema(Some(schema), excelReader, datasetPath)
display(result)
Output export (4).csv
I think I know what's happening: The productId
is specified as a STRING
in the schema, but stored as a number in the Excel file, so we run into this line where we call a method to convert the number to a String.
I'm not sure how we could infer the formatting such that the String looks like an integer at the end.
If you're familiar with Scala debugging, you could try to set a breakpoint at one of the lines above and see what additional information is available.
Otherwise, you could also just make the column a Text in Excel if that's an option.
Could this behavior be triggered if inferSchema=true and a column has mixed values (some string and some numeric, some produced by formulas)?
In looking at the code, it looks like DataFormatter is used when usePlainNumberFormat is set to true, which should be a safe conversion, but I'm seeing behavior that converts number values to scientific notation when there are formulas and mixed values in the column.
Spark 3.1.1 Spark-Excel version: 0.18.7