Add something like pdf_extract_raw()?
Can we add a function to get the raw object data of a pdf?
Purpose: Extract raw PDF object data with object streams unpacked, enabling access to PDF annotations, links, and other internal structures that are currently inaccessible through the existing qpdf R package functions.
I'd like this command:
qpdf --qdf --object-streams=disable input.pdf output.pdf
Outputs the normalized PDF containing raw object data
Proposed Function Signature:
Returns: Raw PDF object data as character lines containing:
PDF header (%PDF-1.6, %QDF-1.0, etc.) etc.
I'm ultimately looking to extract URLs and URIs from PDFs and think this will help me get there.
Maybe you can do this with the pdftools package? e.g. pdftools:: pdf_data()
I guess that could work, but it doesn't seem to be the same kind of output. I guess I can just keep doing the system() call:
# Simple equivalence test using a public PDF
library(pdftools)
# Use a publicly available PDF for testing
test_url <- "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"
temp_pdf <- tempfile(fileext = ".pdf")
download.file(test_url, temp_pdf, mode = "wb")
cat("=== RAW OUTPUT COMPARISON ===\n")
# Method 1: qpdf system call
cat("\n1. qpdf system call output:\n")
qpdf_result <- system2("qpdf", c("--qdf", "--object-streams=disable", temp_pdf, "-"),
stdout = TRUE, stderr = TRUE)
cat("qpdf lines:", length(qpdf_result), "\n")
cat("Sample qpdf output:\n")
print(head(qpdf_result, 10))
# Method 2: pdftools::pdf_data()
cat("\n2. pdftools::pdf_data() output:\n")
pdf_data_result <- pdf_data(temp_pdf)
cat("pdf_data structure:", class(pdf_data_result), "\n")
cat("pdf_data length:", length(pdf_data_result), "\n")
if (length(pdf_data_result) > 0) {
cat("First page structure:\n")
print(str(pdf_data_result[[1]]))
}
# Are they equivalent?
cat("\n=== EQUIVALENCE TEST ===\n")
cat("qpdf: character vector of", length(qpdf_result), "lines\n")
cat("pdf_data: list of", length(pdf_data_result), "data frames\n")
cat("EQUIVALENT:", identical(class(qpdf_result), class(pdf_data_result)), "\n")
unlink(temp_pdf)
I could not get pdftools::pdf_data() to produce the kind of data needed. But this does:
system2("qpdf", c("--qdf", "--object-streams=disable", temp_pdf, "-"), stdout = TRUE, stderr = TRUE)
The problem is that I'm trying to run this on shinyapps.io, which doesn't seem to have an installation of qpdf, so that's why I want to be exposed in this library.