spark-excel icon indicating copy to clipboard operation
spark-excel copied to clipboard

Large Numbers Being Captured As Scientific Notation

Open CHRI4177 opened this issue 5 years ago • 15 comments

Excel displays large numbers such as 12345678901234 as 1.23457E+13 in cells by default, however in the formula bar it still displays the correct value of 12345678901234. Would it be possible to add a feature to specify columns that should use the value in the formula bar instead of the value displayed in the cell.

CHRI4177 avatar May 30 '19 19:05 CHRI4177

What column type does the corresponding column have in the parsed dataframe? The code for converting from Excel cells to DataFrame values is here: https://github.com/crealytics/spark-excel/blob/master/src/main/scala/com/crealytics/spark/excel/DataColumn.scala#L26-L90 If you have recommendations for improvements or better still can come up with a PR, I'd be happy to review :)

nightscape avatar May 30 '19 19:05 nightscape

I'm seeing this as well when I have an Excel column of string values and one of the strings is a numeric value, but the format isn't set for the cell, so it's being formatted in scientific notation. Could there be an option to use the datatype from the dataframe's schema to determine the formatting instead of the cell format?

champagst avatar Jun 07 '19 11:06 champagst

Can you try manually specifying the schema? https://github.com/crealytics/spark-excel#create-a-dataframe-from-an-excel-file-using-custom-schema

nightscape avatar Jun 07 '19 20:06 nightscape

I get the same result specifying the schema.

champagst avatar Jun 10 '19 11:06 champagst

So you specified a DoubleType for that column and it still comes out in scientific notation?

nightscape avatar Jun 10 '19 18:06 nightscape

I have it as a StringType. It's a column of freeform text, some of which are long strings of numbers. Unfortunately, the cells with numbers are left with a general format in Excel, so they're being treated as numeric when they're being formatted.

champagst avatar Jun 10 '19 21:06 champagst

If you know how to set up an IDE an debug a program, you could set breakpoints somewhere here https://github.com/crealytics/spark-excel/blob/master/src/main/scala/com/crealytics/spark/excel/DataColumn.scala#L33-L44 and check if there's another method to call that would give you the desired format. Unfortunately, I can't invest any time in this...

nightscape avatar Jun 11 '19 12:06 nightscape

any updates on this?

mohammadreza490 avatar Jul 29 '21 15:07 mohammadreza490

Hi @mohammadreza490 With the existing description in this thread, I can create a test-case for sure.

However, if you don't mind, please attach with an example that you have (minimal, remove sensitive information), and we will add that to the test-suite.

Sincerely,

quanghgx avatar Jul 30 '21 01:07 quanghgx

Hi @mohammadreza490 , Sorry for bothering you again. Is this request for reading or for writing excel? I am trying to create a test case for this.

quanghgx avatar Jan 15 '22 12:01 quanghgx

Hi @quanghgx,

I faced this issue a long time ago but if I remember correctly, I was trying to read a column that had large numbers and the format was in scientific notation and the cell types weren’t numeric anymore and they were stored as strings.

please let me know if I can help with this issue anymore.

thanks

mohammadreza490 avatar Jan 15 '22 13:01 mohammadreza490

Hey @quanghgx! If I'm not mistaking I just ran into this issue while trying to write excel files with v2. The styles.xml in xlsx sets the format to: <numFmt numFmtId="164" formatCode="0.00E+000"/>. I tried setting the usePlainNumberFormat option to true but it didn't have any effect (I suspect in only applies to reads?). For a quick test I updated this test case to write the dataframe: "plain number format when usePlainNumberFormat=true and inferSchema=true" Please let me know if I should report this as a separate issue or if you can have a look at this. Thanks!

cristichircu avatar Feb 17 '22 13:02 cristichircu

Thanks to @cristichircu , the fix for Decimal format is in MR: https://github.com/crealytics/spark-excel/pull/552/

quanghgx avatar Feb 20 '22 13:02 quanghgx

The fix for writing decimal format is available in https://github.com/crealytics/spark-excel/releases/tag/v0.17.0. It can be activated by setting the "usePlainNumberFormat" option to true.

From what I understand from the previous comments the issue here is with reading numeric columns, not writing them. Assigning back to @quanghgx so he can decide on what to do with this.

cristichircu avatar Apr 19 '22 08:04 cristichircu

Hey @nightscape @mohammadreza490 I am trying to read data from a column which have multiple datatypes, both integer, and decimal. The decimal values are rounded off to some digits only. I am passing inferSchema as false (Eg: 284.259235897532 to 284.2592359 ), If I pass inferSchema as true, integers are being converted into decimals. How to resolve this issue?

shanmukha-albanero avatar Mar 16 '23 08:03 shanmukha-albanero