docxtractr
docxtractr copied to clipboard
docx_extract_all_cmnts(..., include_text = TRUE) failing on edge case
First off, thank you for this package, it's really useful.
I've run into an interesting scenario where the argument include_text = TRUE fails for a word document.
Here are two near identical word documents: works.docx does not work.docx
Both just have the text: "Manuscript text" with the comment "comment text"
However the include_text argument fails for "does not work.docx" due to the introduction to a tab symbol.
"does not work.docx" |>
docxtractr::read_docx() |>
docxtractr::docx_extract_all_cmnts(include_text = TRUE)
#> # A tibble: 1 x 6
#> id author date initials comment_text word_src
#> * <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 0 James Conigrave 2022-01-18T02:08:00Z "" Comment text ""
"works.docx" |>
docxtractr::read_docx() |>
docxtractr::docx_extract_all_cmnts(include_text = TRUE)
#> # A tibble: 1 x 6
#> id author date initials comment_text word_src
#> * <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 0 James Conigrave 2022-01-18T02:08:00Z "" Comment text Manuscript t~
It appears that in the file "does not work" there are small changes to the xml which break the functionality. I'm not quite sure how they have been caused but would love a fix if you have time!