spark-excel icon indicating copy to clipboard operation
spark-excel copied to clipboard

Numbers are parsed to numbers when enforceSchema is set to true and type being String

Open FlavioShneider opened this issue 1 year ago • 4 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Current Behavior

one column value can be large numbers like this 1010675 note:

  • enforceSchema is set to true;
  • the schema type was specified as String;
  • usePlainNumberFormat = true
  • this is similar to , but still different to an existing issue(https://github.com/crealytics/spark-excel/issues/126). I worry the title is different and the problem might be ignored, so raised here with explict problem title.

the current read/parse result is: 1010675.0

Expected Behavior

1010675 string not transformed

Steps To Reproduce

No response

Environment

- Spark version: 3.2.2
- Spark-Excel version: 3.2.2_0.18.5
- OS:
- Cluster environment

Anything else?

No response

FlavioShneider avatar Jun 12 '23 07:06 FlavioShneider

Hi @FlavioShneider, can you post the exact code, and in the best case also the Excel file to reproduce the issue?

nightscape avatar Jun 12 '23 08:06 nightscape

Input Test1.xlsx

Code The code is written in the Azure Databricks (10.4 lts) notebooks style Scala

import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.{DataFrame, DataFrameReader}


val schema = s"`productId` STRING, " +
    s"`price` DOUBLE "

def inferOrEnforceSchema(
  schema: Option[String],
  reader: DataFrameReader,
  path: String
): DataFrame = {
	val df = schema match {
	  case None         => reader.option("inferSchema", "true").load(path)
	  case Some(schema) => reader.option("enforceSchema", "true").schema(schema).load(path)
	}
	df
}

val excelReader = spark.read
  .format("excel")
  .option(
    "dataAddress",
    "0!A3"
  ) 
  .option(
    "usePlainNumberFormat",
    true
  )
  .option(
    "treatEmptyValuesAsNulls",
    true
  ) 
  .option("excerptSize", 30)
  .option("header", "true")

val result = inferOrEnforceSchema(Some(schema), excelReader, datasetPath)

display(result)

Output export (4).csv

FlavioShneider avatar Jun 12 '23 14:06 FlavioShneider

I think I know what's happening: The productId is specified as a STRING in the schema, but stored as a number in the Excel file, so we run into this line where we call a method to convert the number to a String. I'm not sure how we could infer the formatting such that the String looks like an integer at the end. If you're familiar with Scala debugging, you could try to set a breakpoint at one of the lines above and see what additional information is available. Otherwise, you could also just make the column a Text in Excel if that's an option.

nightscape avatar Jun 14 '23 07:06 nightscape

Could this behavior be triggered if inferSchema=true and a column has mixed values (some string and some numeric, some produced by formulas)?

In looking at the code, it looks like DataFormatter is used when usePlainNumberFormat is set to true, which should be a safe conversion, but I'm seeing behavior that converts number values to scientific notation when there are formulas and mixed values in the column.

Spark 3.1.1 Spark-Excel version: 0.18.7

coryvisi avatar Jul 18 '23 16:07 coryvisi