spark-excel
spark-excel copied to clipboard
Large Numbers Being Captured As Scientific Notation
Excel displays large numbers such as 12345678901234 as 1.23457E+13 in cells by default, however in the formula bar it still displays the correct value of 12345678901234. Would it be possible to add a feature to specify columns that should use the value in the formula bar instead of the value displayed in the cell.
What column type does the corresponding column have in the parsed dataframe? The code for converting from Excel cells to DataFrame values is here: https://github.com/crealytics/spark-excel/blob/master/src/main/scala/com/crealytics/spark/excel/DataColumn.scala#L26-L90 If you have recommendations for improvements or better still can come up with a PR, I'd be happy to review :)
I'm seeing this as well when I have an Excel column of string values and one of the strings is a numeric value, but the format isn't set for the cell, so it's being formatted in scientific notation. Could there be an option to use the datatype from the dataframe's schema to determine the formatting instead of the cell format?
Can you try manually specifying the schema? https://github.com/crealytics/spark-excel#create-a-dataframe-from-an-excel-file-using-custom-schema
I get the same result specifying the schema.
So you specified a DoubleType
for that column and it still comes out in scientific notation?
I have it as a StringType
. It's a column of freeform text, some of which are long strings of numbers. Unfortunately, the cells with numbers are left with a general format in Excel, so they're being treated as numeric when they're being formatted.
If you know how to set up an IDE an debug a program, you could set breakpoints somewhere here https://github.com/crealytics/spark-excel/blob/master/src/main/scala/com/crealytics/spark/excel/DataColumn.scala#L33-L44 and check if there's another method to call that would give you the desired format. Unfortunately, I can't invest any time in this...
any updates on this?
Hi @mohammadreza490 With the existing description in this thread, I can create a test-case for sure.
However, if you don't mind, please attach with an example that you have (minimal, remove sensitive information), and we will add that to the test-suite.
Sincerely,
Hi @mohammadreza490 , Sorry for bothering you again. Is this request for reading or for writing excel? I am trying to create a test case for this.
Hi @quanghgx,
I faced this issue a long time ago but if I remember correctly, I was trying to read a column that had large numbers and the format was in scientific notation and the cell types weren’t numeric anymore and they were stored as strings.
please let me know if I can help with this issue anymore.
thanks
Hey @quanghgx! If I'm not mistaking I just ran into this issue while trying to write excel files with v2. The styles.xml in xlsx sets the format to: <numFmt numFmtId="164" formatCode="0.00E+000"/>
. I tried setting the usePlainNumberFormat option to true but it didn't have any effect (I suspect in only applies to reads?).
For a quick test I updated this test case to write the dataframe: "plain number format when usePlainNumberFormat=true and inferSchema=true"
Please let me know if I should report this as a separate issue or if you can have a look at this.
Thanks!
Thanks to @cristichircu , the fix for Decimal format is in MR: https://github.com/crealytics/spark-excel/pull/552/
The fix for writing decimal format is available in https://github.com/crealytics/spark-excel/releases/tag/v0.17.0. It can be activated by setting the "usePlainNumberFormat" option to true.
From what I understand from the previous comments the issue here is with reading numeric columns, not writing them. Assigning back to @quanghgx so he can decide on what to do with this.
Hey @nightscape @mohammadreza490 I am trying to read data from a column which have multiple datatypes, both integer, and decimal. The decimal values are rounded off to some digits only. I am passing inferSchema as false (Eg: 284.259235897532 to 284.2592359 ), If I pass inferSchema as true, integers are being converted into decimals. How to resolve this issue?