openxlsx icon indicating copy to clipboard operation
openxlsx copied to clipboard

`read.xlsx(detectDates = TRUE)` failing

Open jmbarbone opened this issue 3 years ago • 10 comments

Reporting again here as we still have this error

openxlsx::read.xlsx("C:/Users/jmbar/Downloads/bad.xlsx", detectDates = TRUE)
#> Error in read_workbook(cols_in = cell_cols, rows_in = cell_rows, v = v, : basic_string::substr: __pos (which is 8) > this->size() (which is 7)

Created on 2021-11-06 by the reprex package (v2.0.1)

Original

I can confirm the issue is still there. See the file attached, opening it with detectDates=TRUE raises an error:

Error reading date:
44489.4
row: 1
col: 1
Error in read_workbook(cols_in = cell_cols, rows_in = cell_rows, v = v,  : 
  basic_string::substr: __pos (which is 8) > this->size() (which is 7)

bad.xlsx

Originally posted by @aushev in https://github.com/awalker89/openxlsx/issues/249#issuecomment-962523430

jmbarbone avatar Nov 07 '21 02:11 jmbarbone

I have an additional file doing the exact same thing. Would you like me to upload a minimum example for experimental purposes?

ProfFancyPants avatar Feb 28 '22 19:02 ProfFancyPants

Hi @ProfFancyPants , I assume that we understand fairly well what is going on, the question is more or less, why does it happen. I have pushed a fix to the development branch. Please see if this fixes your issue. Though I assume it is only partially right, it should fix the issue, but it is a hack - solving and hiding a problem that should be fixed somewhere else.

JanMarvin avatar Feb 28 '22 20:02 JanMarvin

I checked it with the development branch and the issue still remains exactly as before. I was able to get the reader to do some additional interesting things when I deleted choice cells in the date column where it was loading but actually removing values in other columns. If the bulk of this issue isn't in Apache I could help you take a look. What is so bafilling is that doing a complete copy value paste stops it completely, even with restoring all the previous formatting. My assumption was that there is a hidden or exotic character that looks exactly like the normal character but gets coerced back when value pasted.

ProfFancyPants avatar Mar 01 '22 04:03 ProfFancyPants

If the bulk of this issue isn't in Apache I could help you take a look.

I don't understand what Apache has to do with this issue. When I looked into the issue I've attempted to fix, we expected a string like "2022-03-01", but somehow still had a 7 character wide numeric like "11111.1". Therefore when looking for the part "-01" we fail and the error is thrown. Substring beginning at 8 requested, but only 7 characters provided. My fix checked for the numeric and initiated a conversion from numeric to date. After this the string is long enough. Therefore, I assume this has already been fixed. Unknown to me is why we ended in this situation in the first place. The symptoms can be treated, but they are not the root of the evil.

If you want to look into this, you're ofc welcome :)

JanMarvin avatar Mar 01 '22 07:03 JanMarvin

I think the issue might be deeper and is related to how excel is saving the .XML files internally. In my test file, two end-user identical sheets down to the simplest reproducible example. By "end-user identical" I mean identical as far as it concerns someone using excel and using everything physically to make the two sheets identical. Divergence in the .XML files is between column and row style. image

Exactly end-user identical sheets are being saved as different styles in XML. Once "_openxlsx_loadworksheets" reads in styleObjects(styleObjectsSEXP) and xmlFiles(xmlFilesSEXP) any manner of things could be happening, and it doesn't mean that loadworksheets is making the wrong choices based on what the row's "s=X" and "<v>X</v>" is telling it to do. Also, I haven't quite decerned where all the styleObjects are coming from because some aren't in the workbook anymore.

The worst part is that formatting order seems to matter. If I format the date field to one thing and change it back the .XML files aren't identical.

ProfFancyPants avatar Mar 01 '22 18:03 ProfFancyPants

Sorry, but I don't really get your point. The styles are from styles.xml, some in the <xf.../> and some custom formats are in <numfmt ...>. openxlsx creates styleObjects when loading and saving. Therefore our style IDs must not match those of Excel. The question to solve in this issue is: why is openxlsx currently identifying some cells as dates and why aren't they prepared correctly. Somewhere in the loading process we skip a date creation step. In Excel they are numerics with styles, we convert them from numerics to date/POSIX strings and try to use some substrings for another round of date creation. The entire process is a bit dubious and cannot really say why we're doing it in this specific way.

However, behind the scenes we're working hard on a successor to openxlsx and this problem does not exist in the new code. Therefore my time and interest to solve this one here is currently a bit limited. After all it's not affecting many people.

JanMarvin avatar Mar 01 '22 19:03 JanMarvin

Just tried to read in the bad.xlsx file posted in the initial post with the potential fix in the development branch. Unfortunately, trying to open the file crashs/terminates R.

sessionInfo()
R version 4.2.1 (2022-06-23 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=German_Germany.utf8  LC_CTYPE=German_Germany.utf8    LC_MONETARY=German_Germany.utf8 LC_NUMERIC=C                    LC_TIME=German_Germany.utf8    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] openxlsx_4.2.5.1 testthat_3.1.4  

loaded via a namespace (and not attached):
 [1] zip_2.2.0         Rcpp_1.0.8.3      compiler_4.2.1    pillar_1.7.0      prettyunits_1.1.1 remotes_2.4.2     tools_4.2.1       digest_0.6.29     pkgbuild_1.3.1    pkgload_1.3.0     memoise_2.0.1     lifecycle_1.0.1   tibble_3.1.7      pkgconfig_2.0.3   rlang_1.0.3       cli_3.3.0        
[17] rstudioapi_0.13   commonmark_1.8.0  xfun_0.31         fastmap_1.1.0     xml2_1.3.3        knitr_1.39        stringr_1.4.0     roxygen2_7.2.0    withr_2.5.0       desc_1.4.1        fs_1.5.2          vctrs_0.4.1       devtools_2.4.3    rprojroot_2.0.3   glue_1.6.2        R6_2.5.1         
[33] processx_3.6.1    fansi_1.0.3       sessioninfo_1.2.2 callr_3.7.0       purrr_0.3.4       magrittr_2.0.3    ps_1.7.1          codetools_0.2-18  ellipsis_0.3.2    usethis_2.1.6     utf8_1.2.2        stringi_1.7.6     cachem_1.0.6      crayon_1.5.1      brio_1.1.3    

deschen1 avatar Jul 06 '22 08:07 deschen1

I have the same issue, with a file that was simply written by openxlsx and then read back again (so Excel has never been near it).

write.xlsx(CVAD_list, "CVAD list.xlsx") CVAD_list2 <- read.xlsx("CVAD list.xlsx") CVAD_list3 <- read.xlsx("CVAD list.xlsx", detectDates = T)

Error message from the third line above is as follows:

Error reading date: 44696.5 row: 59 col: 6 Error: basic_string::substr: __pos (which is 8) > this->size() (which is 7)

There's nothing obviously odd about cell F59.

I'm happy to create an example that I can post if it would be useful.

packageVersion("openxlsx") [1] ‘4.2.5.2’

cha-petersumm avatar Mar 30 '23 21:03 cha-petersumm

This issue is stale because it has been open 365 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Mar 30 '24 01:03 github-actions[bot]

This is not stale, I got the same error just now. Is there a fix? thanks :D

CarolusKwok avatar Apr 05 '24 17:04 CarolusKwok