Fatal R error when attempting to extract text from a PDF that includes a particular mathematical symbol
Description
Fatal R error when attempting to use extract_text on a PDF that includes $\bar{x}$. There's no error message, R just terminates.
Reproducible example
I have constructed a simple example PDF, attached xbar.pdf, that gives the error. (I made this using Microsoft Word, inserting the $x$ and $\bar{x}$ using the equation editor, then saving to PDF.)
As this crashes R I can't use the reprex package for this, as far as I know...
library(tabulapdf)
# First try getting the text up to but not including the x-bar
out1 <- extract_text("xbar.pdf", area = list(c(0,0,200,193)))
# This works
# Get the whole text
out2 <- extract_text("xbar.pdf")
# This gives a fatal error
# Get the text for just the x-bar area
out3 <- extract_text("xbar.pdf", area = list(c(0,193,200,210)))
# This gives a fatal error
Note that if I call the tabula.jar bundled with the R package directly from the command line like this
java -jar C:\Users\<username>\AppData\Local\R\win-library\4.4\tabulapdf\java\tabula.jar xbar.pdf
I get the following output (which is fine for my purposes - I am not particularly concerned about the $\bar{x}$ rendering properly, I just don't want the R session to crash):
Aug 06, 2024 10:03:59 AM org.apache.fontbox.ttf.CmapSubtable processSubtype14
WARNING: Format 14 cmap table is not supported and will be ignored
The mean of x is denoted ???
Expected result
No fatal error: I would expect any issues with reading/rendering the $\bar{x}$ to result in a fallback like putting in '??' or similar.
Session info
R version 4.4.0 (2024-04-24 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 10 x64 (build 19045)
Matrix products: default
locale:
[1] LC_COLLATE=English_United Kingdom.utf8 LC_CTYPE=English_United Kingdom.utf8
[3] LC_MONETARY=English_United Kingdom.utf8 LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.utf8
time zone: Europe/London
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] tabulapdf_1.0.5-3
loaded via a namespace (and not attached):
[1] utf8_1.2.4 R6_2.5.1 tzdb_0.4.0 magrittr_2.0.3 glue_1.7.0 tibble_3.2.1
[7] pkgconfig_2.0.3 png_0.1-8 rJava_1.0-11 lifecycle_1.0.4 readr_2.1.5 cli_3.6.2
[13] fansi_1.0.6 vctrs_0.6.5 compiler_4.4.0 rstudioapi_0.16.0 tools_4.4.0 hms_1.1.3
[19] pillar_1.9.0 rlang_1.1.3
@tomsutch thx for reporting this I can fix it next week
@tomsutch it took me longer than expected but I think I was able to solve it
hi @tomsutch just following up did the last commit solve the issue?
Hi, thanks for looking into this! I can't see a new commit here - please could you point me to it?
Hi, thanks for looking into this! I can't see a new commit here - please could you point me to it?
sorry, i realize i never pushed the commit
i did it now in dev/
but I realize that it fails on ubuntu but worked on windows when i set utf-8
hola @jazzido
@tomsutch found this very interesting case that I can't solve "universally"
do you have any clues?
I added my test to reproduce the error here https://github.com/ropensci/tabulapdf/blob/main/dev/test-special_characters.R
and the file here https://github.com/ropensci/tabulapdf/blob/main/inst/examples/xbar.pdf
@tomsutch @jazzido
I proposed a fix here https://github.com/pachadotdev/tabula-java/commit/7bcb49cadfa1fa0edde4516539a25317e4147128
but when I build the jar locally, the produced jar does no longer work with R
this:
load_doc <- function(file, password = NULL, copy = FALSE) {
localfile <- localize_file(path = file, copy = copy)
pdfDocument <- new(J("org.apache.pdfbox.pdmodel.PDDocument"))
fileInputStream <- new(J("java.io.FileInputStream"), name <- localfile)
if (is.null(password)) {
message("HERE")
doc <- pdfDocument$load(input = fileInputStream)
} else {
doc <- pdfDocument$load(input = fileInputStream, password = password)
}
pdfDocument$close()
doc
}
fails with:
HERE
Error in pdfDocument$load :
no field, method or inner class called 'load'
Hi Is there any update on this one? I encountered another fatal error that aborts the R session when $\hat{\beta}$ is in the pdf example2.pdf
tabulapdf::extract_text('example2.pdf', pages = 1, area = list(c(333.9459, 655.1610, 352.8368, 686.0823)))
Hi Is there any update on this one? I encountered another fatal error that aborts the R session when β ^ is in the pdf example2.pdf
tabulapdf::extract_text('example2.pdf', pages = 1, area = list(c(333.9459, 655.1610, 352.8368, 686.0823)))
I proposed a fix to the Java code, but the produced jar is not working for me I pinged @jazzido about the build process
I updated to tabula 1.0.6, but because I do not know Java, I cannot fix the issue coming from there
see https://github.com/ropensci/tabulapdf/tree/166
The solution is that Java returns "The mean of x is denoted ?" instead of "The mean of x is denoted ?̅?"