docxtractr icon indicating copy to clipboard operation
docxtractr copied to clipboard

docx_extract_all_cmnts(..., include_text = TRUE) failing on edge case

Open conig opened this issue 3 years ago • 0 comments

First off, thank you for this package, it's really useful.

I've run into an interesting scenario where the argument include_text = TRUE fails for a word document.

Here are two near identical word documents: works.docx does not work.docx

Both just have the text: "Manuscript text" with the comment "comment text"

However the include_text argument fails for "does not work.docx" due to the introduction to a tab symbol.

"does not work.docx" |> 
  docxtractr::read_docx() |> 
  docxtractr::docx_extract_all_cmnts(include_text = TRUE)
#> # A tibble: 1 x 6
#>   id    author          date                 initials comment_text word_src
#> * <chr> <chr>           <chr>                <chr>    <chr>        <chr>   
#> 1 0     James Conigrave 2022-01-18T02:08:00Z ""       Comment text ""
"works.docx" |> 
  docxtractr::read_docx() |> 
  docxtractr::docx_extract_all_cmnts(include_text = TRUE)
#> # A tibble: 1 x 6
#>   id    author          date                 initials comment_text word_src     
#> * <chr> <chr>           <chr>                <chr>    <chr>        <chr>        
#> 1 0     James Conigrave 2022-01-18T02:08:00Z ""       Comment text Manuscript t~

It appears that in the file "does not work" there are small changes to the xml which break the functionality. I'm not quite sure how they have been caused but would love a fix if you have time!

conig avatar Jan 18 '22 05:01 conig