readxl icon indicating copy to clipboard operation
readxl copied to clipboard

Better guessing of desired precision when coercing double to character

Open arnyeinstein opened this issue 7 years ago • 12 comments

Hi I have the following code: ` t <-read_excel("NI-Supply-Use-Tables-Multipliers-2012_1.xlsx", sheet ="SIC07 Classifications (SIC)", skip = 2, col_types = c("text","text","text","text"))

SUT <- select(t, SUTcode,Description) %>% filter(SUTcode > 1) %>% print(n=100) ` The excel file column "SUTcode" contains a mixture of numbers and strings, I therefore set the format in Excel explicitly to "text". When I read the file, the SUTcode 60.1 is written as 69.099999999999994 (same happens for some other SUTcodes). I tried to correct this by entering the code explicitly as a text in Excel (e.g. '60.1 and not 60.1), but the SUTcodes still appear with trailing 9s. NI-Supply-Use-Tables-Multipliers-2012_1.xlsx

No idea what is going wrong. Cheers Renger

My setup (Excel 2016) R version 3.4.0 (2017-04-21) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 14393)

Matrix products: default

locale: [1] LC_COLLATE=German_Switzerland.1252 LC_CTYPE=German_Switzerland.1252
[3] LC_MONETARY=German_Switzerland.1252 LC_NUMERIC=C
[5] LC_TIME=German_Switzerland.1252

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] readxl_1.0.0 dplyr_0.5.0 purrr_0.2.2 readr_1.1.0
[5] tidyr_0.6.1 tibble_1.3.0 ggplot2_2.2.1 tidyverse_1.1.1

loaded via a namespace (and not attached): [1] Rcpp_0.12.10 compiler_3.4.0 cellranger_1.1.0 plyr_1.8.4
[5] forcats_0.2.0 tools_3.4.0 jsonlite_1.4 lubridate_1.6.0 [9] gtable_0.2.0 nlme_3.1-131 lattice_0.20-35 psych_1.7.3.21
[13] DBI_0.6-1 parallel_3.4.0 haven_1.0.0 stringr_1.2.0
[17] httr_1.2.1 xml2_1.1.1 hms_0.3 grid_3.4.0
[21] R6_2.2.0 rematch_1.0.1 foreign_0.8-68 reshape2_1.4.2
[25] modelr_0.1.0 magrittr_1.5 scales_0.4.1 fortunes_1.5-4
[29] assertthat_0.2.0 mnormt_1.5-5 rvest_0.3.2 colorspace_1.3-2 [33] stringi_1.1.5 lazyeval_0.2.0 munsell_0.4.3 broom_0.4.2

arnyeinstein avatar May 09 '17 13:05 arnyeinstein

I am having the same issue. If I try to read an .xlsx doc with the cells below, I get a similar rounding of the text. I am attempting to read a list of specs where the numeric entry refers to a part number, so the rounding is an issue. I can read as numeric without it being modified, but then I lose the other information.

excel

read_excel("SampleReadExcel.xlsx", col_types = "text", col_names = FALSE) A tibble: 4 × 1 X__1 1 text1 2 text2 3 another text 4 342.39019999999999

read_excel("SampleReadExcel.xlsx", col_types = "numeric", col_names = FALSE) A tibble: 4 × 1 X__1 1 NA 2 NA 3 NA 4 342.3902

kraillerj avatar May 15 '17 19:05 kraillerj

I'm also having this problem. This looks like an issue with floating point representations. It could be that somewhere along the line in read_excel, the cells containing numbers are being read in as numbers (specifically, as floats) and then being converted to text. But base R certainly has no such troubles - as.character(5.1) does not yeild 5.0999999999999996 - so I'm not sure how the trouble could be arising.

I wouldn't put it past excel to do something daft like somehow store text cells containing numbers as floating points.

ftlbiped avatar May 30 '17 22:05 ftlbiped

@arnyeinstein Here's what I get when I read the xlsx you provided, using your code:

library(dplyr)
library(readxl)

df <- read_excel(
  "NI-Supply-Use-Tables-Multipliers-2012_1.xlsx",
  sheet = "SIC07 Classifications (SIC)", skip = 2,
  col_types = c("text","text","text","text")
)        

df <- df %>% 
  rename(SUTcode = `SUT code (=SIC 2007 industry & CPA 2008 product code)`)

select(df, SUTcode, Description) %>%
  filter(SUTcode > 1) %>%
  print(n = Inf)
#> # A tibble: 104 x 2
#>     SUTcode
#>       <chr>
#>   1    10.1
#>   2  10.2-3
#>   3    10.4
#>   4    10.5
#>   5    10.6
#>   6    10.7
#>   7    10.8
#>   8    10.9
#>   9 11.01-6
#>  10   11.07
#>  11      12
#>  12      13
#>  13      14
#>  14      15
#>  15      16
#>  16      17
#>  17      18
#>  18      19
#>  19     20A
#>  20     20B
#>  21     20C
#>  22    20.3
#>  23    20.4
#>  24    20.5
#>  25      21
#>  26      22
#>  27 23OTHER
#>  28  23.5-6
#>  29  24.1-3
#>  30  24.4-5
#>  31 25OTHER
#>  32    25.4
#>  33      26
#>  34      27
#>  35      28
#>  36      29
#>  37    30.1
#>  38    30.3
#>  39 30OTHER
#>  40      31
#>  41      32
#>  42   30.15
#>  43   33.16
#>  44 33OTHER
#>  45    35.1
#>  46  35.2-3
#>  47      36
#>  48      37
#>  49      38
#>  50      39
#>  51      41
#>  52      42
#>  53      43
#>  54      45
#>  55      46
#>  56      47
#>  57  49.1-2
#>  58  49.3-5
#>  59      50
#>  60      51
#>  61      52
#>  62      53
#>  63      55
#>  64      56
#>  65      58
#>  66      59
#>  67      60
#>  68      61
#>  69      62
#>  70      63
#>  71      64
#>  72  65.1-2
#>  73    65.3
#>  74      66
#>  75  68.1-2
#>  76 68.2IMP
#>  77    68.3
#>  78    69.1
#>  79    69.2
#>  80      70
#>  81      71
#>  82      72
#>  83      73
#>  84      74
#>  85      75
#>  86      77
#>  87      78
#>  88      79
#>  89      80
#>  90      81
#>  91      82
#>  92      84
#>  93      85
#>  94      86
#>  95      87
#>  96      88
#>  97      90
#>  98      91
#>  99      92
#> 100      93
#> 101      94
#> 102      95
#> 103      96
#> 104      97
#> # ... with 1 more variables: Description <chr>

I don't see a code "60.1" nor do I see the phenomenon you describe.

That being said, this will always be a potential problem when a numeric cell is holding information that is not really a number. Under the hood, Excel stores all numbers as floating point and it has to be converted to text.

It's much better to do as you say and make the cell text (vs numeric) on the Excel side. This, in fact, already appears to be true of this sheet. AFAICT the entire SUT code column has been stored as text. Which makes sense with what I'm seeing above.

jennybc avatar Jun 14 '17 21:06 jennybc

@jessekrailler I advise that you format that cell as text in Excel, rather than numeric. If the scale of your problem makes such manual work impractical, then you could import the column as a "list" type. Your text cells will be character, your numeric cells will be numeric. Then you could convert the numeric cell(s) to character with the precision you wish, allowing you to convert the entire column to character. If you provide an example sheet, I could demonstrate that.

jennybc avatar Jun 14 '17 21:06 jennybc

The conversion of Excel numbers to text happens at the C level and there's no way of knowing how many decimal places to show, i.e. how many decimal places were initially typed in by a human or are currently visible to the human eye in the Excel application. I don't see how readxl could determine to show 1 decimal place for @arnyeinstein's example and 4 for @jessekrailler's. This is a fundamental awkwardness of having data of mixed type in a single column, but importing into a data frame.

jennybc avatar Jun 14 '17 21:06 jennybc

Asking around, it seems in the medium-term, i.e. next release, I could try to improve readxl's automatic guessing of precision for a "number in a text column". There are ways to access similar logic as R and Excel are using. But for today, there are the solutions described above.

Making such a cell a 'text' cell in Excel will always deliver the best result, because it completely avoids the floating point to text conversion.

jennybc avatar Jun 15 '17 13:06 jennybc

Thanks for your help @jennybc! I tried formatting the cells as text prior to reading the file, but still got the same results (file attached). I will use your suggestion to read in as a list and modify from there. I had a feeling this was an underlying R/Excel issue, but I really appreciate your work around suggestion. SampleReadExcel.xlsx

kraillerj avatar Jun 15 '17 14:06 kraillerj

@jessekrailler I experimented with your file. It seems that the gratuitous decimal places are stored in the case of a numeric cell to which you then apply a text format. If you format a cell as text, then enter a number, we get the desired result. Likewise, we get what we want when the number is preceded by a single quote, which another way to force text.

readxl::read_excel("~/Downloads/SampleReadExcel-2.xlsx")
#> # A tibble: 5 x 2
#>                text1                                                X__1
#>                <chr>                                               <chr>
#> 1              text2                                                <NA>
#> 2       another text                                                <NA>
#> 3 342.39019999999999 entered as numeric, then made into text, I suppose?
#> 4           342.3902                            made text, then typed in
#> 5           342.3902   entered with a leading single quote to force text

jennybc avatar Jun 15 '17 15:06 jennybc

Im stuck in a similar prediciment where the files are coming from someone else, and seem to have the formatting issues mentioned by @jennybc. The columns in my source data contain a mix of numeric and text, and I've found that reading the columns as lists then converting to characters seems to honor the original excel format.

It's not a perfect solution, but it seems to work for my case:

df<- read_xlsx("Crappy_File.xlsx", col_types = "list") %>%
  mutate_all(funs(as.character(.))) 

Although its a bit late, a workaround for @arnyeinstein might look like:

t <-read_excel("NI-Supply-Use-Tables-Multipliers-2012_1.xlsx",
    sheet ="SIC07 Classifications (SIC)", skip = 2, col_types = "list") %>%
    mutate_all(funs(as.character(.))) %>%
    mutate_at("SUTcode", funs(as.numeric(.)))  

SUT <- select(t, SUTcode,Description) %>%
filter(SUTcode > 1) %>%
print(n=100)

andbarker avatar Feb 25 '18 00:02 andbarker

Somehow I ended up with a hard copy of The Recompiler which has:

A Crash Course on Floating Point Numbers by Federico Mena-Quintero

Gave a good bibliography re: classic references.

Grisu algorithm (used in readr) for printing

Florian Loitsch. 2010. Printing floating-point numbers quickly and accurately with integers. In Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '10). ACM, New York, NY, USA, 233-243. DOI: https://doi.org/10.1145/1806596.1806623

Florian Loitsch. 2010. Printing floating-point numbers quickly and accurately with integers. SIGPLAN Not. 45, 6 (June 2010), 233-243. DOI: https://doi.org/10.1145/1809028.1806623

https://www.cs.tufts.edu/~nr/cs257/archive/florian-loitsch/printf.pdf

Bellerophon for reading

William D. Clinger. 1990. How to read floating point numbers accurately. SIGPLAN Not. 25, 6 (June 1990), 92-101. DOI=http://dx.doi.org/10.1145/93548.93557

William D. Clinger. 1990. How to read floating point numbers accurately. In Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation (PLDI '90). ACM, New York, NY, USA, 92-101. DOI=http://dx.doi.org/10.1145/93542.93557

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.45.4152&rep=rep1&type=pdf

This smells related to some of readxl's difficulties when bringing in date times:

The key idea is this. Consider the problem of computing the value of some function g, rounded to the nearest 10000, given an oracle that delivers the value of g rounded to the nearest 10. Given an input x, the obvious approach is to ask the oracle for the value of g(x) to the nearest 10, and then to round that result to the nearest 10000. Unfortunately, this does not always work. I g(x) is 11,074,996 and g(y) is 11,075,004, for example, then the answer should be 11,070,000 and 11,080,000, respectively, but the oracle will deliver 11,075,000 for both x and y. This approach usually works, though, It fails only when the value delivered by the oracle ends in 5000.

It turns out you can't truly solve ^^ that problem without bignums .

Floating-point numbers in general

David Goldberg. 1991. What every computer scientist should know about floating-point arithmetic. ACM Comput. Surv. 23, 1 (March 1991), 5-48. DOI: https://doi.org/10.1145/103162.103163

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.22.6768

jennybc avatar Oct 29 '18 00:10 jennybc

This is related to #510, I changed cell format to text but still get the same result. 3B should show 132834.3 instead of 132834.29999999999.

https://www.dropbox.com/s/9jdmatvz797p889/testdata10880-text.xlsx?dl=0

tmp <- tempfile(fileext = ".xlsx")
dir.create(tempdir(), showWarnings = FALSE)
download.file("https://www.dropbox.com/s/9jdmatvz797p889/testdata10880-text.xlsx?dl=1", destfile = tmp, mode = "wb")
df <- readxl::read_excel(tmp, col_types = "text")
head(df[df$ID == "B",])
#> # A tibble: 4 x 2
#>   ID    Value             
#>   <chr> <chr>             
#> 1 B     48275.3           
#> 2 B     495824.2          
#> 3 B     132834.29999999999
#> 4 B     348842

Created on 2019-02-11 by the reprex package (v0.2.1)

I just wanted to double check if the root cause is as same as this #360

hidekoji avatar Feb 11 '19 18:02 hidekoji

@hidekoji Yeah, as I recently commented in #360, you can't really fix this by applying a text format post hoc. Here is what's on disk for that cell, after you change the format to text:

<c r="B11" s="1">
<v>132834.29999999999</v>
</c>

Once all that spurious precision is in there, there has to be some algorithm to remove it. Excel is doing this all the time -- the data on disk is not what is revealed in the UI. But their logic and what they show the user is not accessible to third party software like readxl.

jennybc avatar Feb 13 '19 05:02 jennybc