pdftools icon indicating copy to clipboard operation
pdftools copied to clipboard

Page orientation

Open anuraag94 opened this issue 7 years ago • 1 comments

First, I want to just say that this is fantastic package and has been extremely helpful, thank you.

I'm writing a parser to extract data from unstructured pdfs, and sometimes the pages are rotated 90 degrees. I'm aware that the mediabox stores properties like page width and page height, and with a few exceptions, I can back out the page orientation using that.

My question is whether accessing the mediabox is possible using the PDFTools package, or if you know of any other means I can do this within my R program? Any solution will be much appreciated!

anuraag94 avatar Mar 14 '18 18:03 anuraag94

  • pdf_pagesize() returns a data frame with page size information (one row per page). This can be used to calculate whether a page is in "portrait" or "landscape" mode (but I don't think you can distinguish a clockwise versus counterclockwise 90 degrees rotation through this or a page flipped upside down 180 degrees).
pdf_orientation <- function(input) {
    df <- pdftools::pdf_pagesize(input)
    ifelse(df$height < df$width, "landscape", "portrait")
}
  • On my machine with my pdf files I observe that the page widths/heights seem to be in "big points" i.e. 72 big points = 1 inch. The grid::unit() units for "big points" is "bigpts".
  • Caveat: I've noticed on my Linux machine that pdf_pagesize() incorrectly flips the height / width for some rotated pages on a subset of pdf files. Unsure if this is a bug in the pdf files or a bug in my system Poppler library (which is probably a few years old) but this bug seems to go away if I first pre-process the pdf file by running it through ghostscript first with something like the following help function:
pdf_gs <- function(input, output = NULL, ..., args = character(0L)) {
    input <- normalizePath(input)
    if (!length(output)) 
        output <- sub("\\.pdf$", "_output.pdf", input)
    output <- normalizePath(output, mustWork = FALSE)
    args <- c("-dBATCH",
              "-dNOPAUSE",
              "-sDEVICE=pdfwrite",
              "-sAutoRotatePages=None",
              paste0("-sOutputFile=", shQuote(output)),
              args,
              shQuote(input))
    cmd <- tools::find_gs_cmd()
    stdout <- system2(cmd, args, stdout = TRUE)
    invisible(output)
}
input |> pdf_gs() |> pdf_orientation()

trevorld avatar Nov 14 '24 19:11 trevorld