stri_match regex works interactively but not when sourcing script
I have a script that is extracting text from a PDF file using
text <- pdftools::pdf_text(...)
text <- paste0(text, collapse = '')
I then wish to extract a match from this text string using a regular expression to effectively pull the relevant table from the text. As the string will contain \r\n within it I am setting dotall = TRUE as follows...
data <- stri_match(text, regex = paste0(startText, '(.*)', endText), dotall = TRUE)
table <- data[1,2]
where startText and endText are delimiting text elements marking the outside of the table text.
The issue is that whilst this match line works when run interactively from within RStudio it does not work when the script is sourced, instead returning NA.
I was using version 1.1.7 of stringi under Microsoft Open R 3.4.4 (I am unable to update to a later version due to compatibility with another library used). I updated to 1.4.3 by updating the MRAN snapshot from the default 2018-04-01 to 2021-05-31. There was no difference in the result.
Unfortunately, I'm unable to reproduce the above, because you haven't provided me with enough details. Could you please prepare a concrete example that I could run on my machine (with a PDF file at http://somewhere/etc)?
One thing I could recommend is to always run text <- stri_replace_all_fixed(text, "\r\n", "\n") first in situations such as this.
Or operate on the output of stri_split_lines1 or stri_split_lines.
Download the pdf ASX_Energy_Margin_Parameters.pdf) Edit the code below to reflect the path. When sourced the script prints NA, When the last two code statements are run interactively the parsed/extracted text will display.
rm(list = ls())
suppressPackageStartupMessages(library(pdftools))
suppressPackageStartupMessages(library(stringi))
GetText <- function(filepath) {
text <- pdf_text(filepath)
text <- paste0(text, collapse = '')
return(text)
}
GetSection <- function(text, startText, endText) {
m <- stri_match(text, regex = paste0(startText, '(.*)', endText), dotall = TRUE)
return(m[1,2])
}
file <- '<pdf path>'
text <- GetText(file)
initSpan <- GetSection(text, 'Australian Energy - Initial Margin Rates & Span Parameters', 'Australian Energy – Liquidity Margin Add-on Parrameters')
print(initSpan)
The following Python code using pymupdf does not display the same issue
import fitz as pdf
import re
def extract_text_from_pdf(filename):
doc = pdf.open(filename)
pages = doc.pages()
text = ''.join([page.getText() for page in pages])
return (text)
text = extract_text_from_pdf('<pdf path>')
match = re.findall('Australian Energy - Initial Margin Rates & Span Parameters (.*)Australian Energy – Liquidity Margin Add-on Parrameters', text, re.DOTALL)[0]
Could there be something in the specifics of my R code that RStudio is covering up for?
Microsoft Open R 3.4.4 RStudio 1.2.5019 Windows 10 2004
Unfortunately I cannot reproduce the above problems, but I'm on Linux. I suspect this has nothing to do with stringi.
Have you tried replacing \r\n with \n manually?
Are you using the most recent version on pdftools?
Can you reproduce the above with the most recent version of R (on another computer)?
I'll see if I can find another machine to trial it but I'm not convinced by the logic of your first two statements. I don't think it logically follows that it didn't happen on Linux implies it isn't the library's fault when the issue happens on Windows.
What I have said is that I cannot reproduce this error on Linux and hence it is difficult for me to help you. Still, I would like to find a solution to the above.
You have also not answered my question whether replacing \r\n with \n solves the issue.
One other option: could you please serialise the text objects in both settings (interactive and non-interactive) and post their dumps here? E.g., via dump("text", file="output_path...").
Hi @gagolews ,
I've encountered the same behaviour as @totalgit74 - in my case with the function stri_replace_all_fixed.
However, I found the solution to my problem, which doesn't lie within stringi, but the source() command itself: https://stackoverflow.com/questions/5031630/how-to-source-r-file-saved-using-utf-8-encoding/
Basically, switching from source('filename.r') to eval(parse('filename.r', encoding = 'UTF-8')) solved my problem.
I know it's a bit off-topic and not related to stringi, but I thought it would maybe be useful to others searching the same problem.
Best, Martin
(closing due to inactivity)