openxlsx
openxlsx copied to clipboard
`read.xlsx(detectDates = TRUE)` failing
Reporting again here as we still have this error
openxlsx::read.xlsx("C:/Users/jmbar/Downloads/bad.xlsx", detectDates = TRUE)
#> Error in read_workbook(cols_in = cell_cols, rows_in = cell_rows, v = v, : basic_string::substr: __pos (which is 8) > this->size() (which is 7)
Created on 2021-11-06 by the reprex package (v2.0.1)
Original
I can confirm the issue is still there. See the file attached, opening it with
detectDates=TRUE
raises an error:Error reading date: 44489.4 row: 1 col: 1 Error in read_workbook(cols_in = cell_cols, rows_in = cell_rows, v = v, : basic_string::substr: __pos (which is 8) > this->size() (which is 7)
Originally posted by @aushev in https://github.com/awalker89/openxlsx/issues/249#issuecomment-962523430
I have an additional file doing the exact same thing. Would you like me to upload a minimum example for experimental purposes?
Hi @ProfFancyPants , I assume that we understand fairly well what is going on, the question is more or less, why does it happen. I have pushed a fix to the development branch. Please see if this fixes your issue. Though I assume it is only partially right, it should fix the issue, but it is a hack - solving and hiding a problem that should be fixed somewhere else.
I checked it with the development branch and the issue still remains exactly as before. I was able to get the reader to do some additional interesting things when I deleted choice cells in the date column where it was loading but actually removing values in other columns. If the bulk of this issue isn't in Apache I could help you take a look. What is so bafilling is that doing a complete copy value paste stops it completely, even with restoring all the previous formatting. My assumption was that there is a hidden or exotic character that looks exactly like the normal character but gets coerced back when value pasted.
If the bulk of this issue isn't in Apache I could help you take a look.
I don't understand what Apache has to do with this issue. When I looked into the issue I've attempted to fix, we expected a string like "2022-03-01", but somehow still had a 7 character wide numeric like "11111.1". Therefore when looking for the part "-01" we fail and the error is thrown. Substring beginning at 8 requested, but only 7 characters provided. My fix checked for the numeric and initiated a conversion from numeric to date. After this the string is long enough. Therefore, I assume this has already been fixed. Unknown to me is why we ended in this situation in the first place. The symptoms can be treated, but they are not the root of the evil.
If you want to look into this, you're ofc welcome :)
I think the issue might be deeper and is related to how excel is saving the .XML files internally. In my test file, two end-user identical sheets down to the simplest reproducible example. By "end-user identical" I mean identical as far as it concerns someone using excel and using everything physically to make the two sheets identical. Divergence in the .XML files is between column and row style.
Exactly end-user identical sheets are being saved as different styles in XML. Once "_openxlsx_loadworksheets" reads in styleObjects(styleObjectsSEXP) and xmlFiles(xmlFilesSEXP) any manner of things could be happening, and it doesn't mean that loadworksheets is making the wrong choices based on what the row's "s=X" and "<v>X</v>"
is telling it to do. Also, I haven't quite decerned where all the styleObjects are coming from because some aren't in the workbook anymore.
The worst part is that formatting order seems to matter. If I format the date field to one thing and change it back the .XML files aren't identical.
Sorry, but I don't really get your point. The styles are from styles.xml, some in the <xf.../>
and some custom formats are in <numfmt ...>
. openxlsx
creates styleObjects when loading and saving. Therefore our style IDs must not match those of Excel. The question to solve in this issue is: why is openxlsx
currently identifying some cells as dates and why aren't they prepared correctly. Somewhere in the loading process we skip a date creation step. In Excel they are numerics with styles, we convert them from numerics to date/POSIX strings and try to use some substrings for another round of date creation. The entire process is a bit dubious and cannot really say why we're doing it in this specific way.
However, behind the scenes we're working hard on a successor to openxlsx
and this problem does not exist in the new code. Therefore my time and interest to solve this one here is currently a bit limited. After all it's not affecting many people.
Just tried to read in the bad.xlsx file posted in the initial post with the potential fix in the development branch. Unfortunately, trying to open the file crashs/terminates R.
sessionInfo()
R version 4.2.1 (2022-06-23 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)
Matrix products: default
locale:
[1] LC_COLLATE=German_Germany.utf8 LC_CTYPE=German_Germany.utf8 LC_MONETARY=German_Germany.utf8 LC_NUMERIC=C LC_TIME=German_Germany.utf8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] openxlsx_4.2.5.1 testthat_3.1.4
loaded via a namespace (and not attached):
[1] zip_2.2.0 Rcpp_1.0.8.3 compiler_4.2.1 pillar_1.7.0 prettyunits_1.1.1 remotes_2.4.2 tools_4.2.1 digest_0.6.29 pkgbuild_1.3.1 pkgload_1.3.0 memoise_2.0.1 lifecycle_1.0.1 tibble_3.1.7 pkgconfig_2.0.3 rlang_1.0.3 cli_3.3.0
[17] rstudioapi_0.13 commonmark_1.8.0 xfun_0.31 fastmap_1.1.0 xml2_1.3.3 knitr_1.39 stringr_1.4.0 roxygen2_7.2.0 withr_2.5.0 desc_1.4.1 fs_1.5.2 vctrs_0.4.1 devtools_2.4.3 rprojroot_2.0.3 glue_1.6.2 R6_2.5.1
[33] processx_3.6.1 fansi_1.0.3 sessioninfo_1.2.2 callr_3.7.0 purrr_0.3.4 magrittr_2.0.3 ps_1.7.1 codetools_0.2-18 ellipsis_0.3.2 usethis_2.1.6 utf8_1.2.2 stringi_1.7.6 cachem_1.0.6 crayon_1.5.1 brio_1.1.3
I have the same issue, with a file that was simply written by openxlsx and then read back again (so Excel has never been near it).
write.xlsx(CVAD_list, "CVAD list.xlsx") CVAD_list2 <- read.xlsx("CVAD list.xlsx") CVAD_list3 <- read.xlsx("CVAD list.xlsx", detectDates = T)
Error message from the third line above is as follows:
Error reading date: 44696.5 row: 59 col: 6 Error: basic_string::substr: __pos (which is 8) > this->size() (which is 7)
There's nothing obviously odd about cell F59.
I'm happy to create an example that I can post if it would be useful.
packageVersion("openxlsx") [1] ‘4.2.5.2’
This issue is stale because it has been open 365 days with no activity. Remove stale label or comment or this will be closed in 7 days.
This is not stale, I got the same error just now. Is there a fix? thanks :D