stringi icon indicating copy to clipboard operation
stringi copied to clipboard

stri_match regex works interactively but not when sourcing script

Open totalgit74 opened this issue 4 years ago • 6 comments

I have a script that is extracting text from a PDF file using

text <- pdftools::pdf_text(...)
text <- paste0(text, collapse = '')

I then wish to extract a match from this text string using a regular expression to effectively pull the relevant table from the text. As the string will contain \r\n within it I am setting dotall = TRUE as follows...

data <- stri_match(text, regex = paste0(startText, '(.*)', endText), dotall = TRUE)
table <- data[1,2]

where startText and endText are delimiting text elements marking the outside of the table text. The issue is that whilst this match line works when run interactively from within RStudio it does not work when the script is sourced, instead returning NA.

I was using version 1.1.7 of stringi under Microsoft Open R 3.4.4 (I am unable to update to a later version due to compatibility with another library used). I updated to 1.4.3 by updating the MRAN snapshot from the default 2018-04-01 to 2021-05-31. There was no difference in the result.

totalgit74 avatar May 31 '21 23:05 totalgit74

Unfortunately, I'm unable to reproduce the above, because you haven't provided me with enough details. Could you please prepare a concrete example that I could run on my machine (with a PDF file at http://somewhere/etc)?

One thing I could recommend is to always run text <- stri_replace_all_fixed(text, "\r\n", "\n") first in situations such as this. Or operate on the output of stri_split_lines1 or stri_split_lines.

gagolews avatar Jun 01 '21 23:06 gagolews

Download the pdf ASX_Energy_Margin_Parameters.pdf) Edit the code below to reflect the path. When sourced the script prints NA, When the last two code statements are run interactively the parsed/extracted text will display.

rm(list = ls())

suppressPackageStartupMessages(library(pdftools))
suppressPackageStartupMessages(library(stringi))

GetText <- function(filepath) {
  text <- pdf_text(filepath)
  text <- paste0(text, collapse = '')
  
  return(text)
}

GetSection <- function(text, startText, endText) {
  m <- stri_match(text, regex = paste0(startText, '(.*)', endText), dotall = TRUE)
  
  return(m[1,2])
}

file <- '<pdf path>'

text <- GetText(file)

initSpan <- GetSection(text, 'Australian Energy - Initial Margin Rates & Span Parameters', 'Australian Energy – Liquidity Margin Add-on Parrameters')

print(initSpan)

The following Python code using pymupdf does not display the same issue

import fitz as pdf
import re

def extract_text_from_pdf(filename):
    doc = pdf.open(filename)
    pages = doc.pages()
    text = ''.join([page.getText() for page in pages])

    return (text)

text = extract_text_from_pdf('<pdf path>')
match = re.findall('Australian Energy - Initial Margin Rates & Span Parameters (.*)Australian Energy – Liquidity Margin Add-on Parrameters', text, re.DOTALL)[0]

Could there be something in the specifics of my R code that RStudio is covering up for?

Microsoft Open R 3.4.4 RStudio 1.2.5019 Windows 10 2004

totalgit74 avatar Jun 02 '21 00:06 totalgit74

Unfortunately I cannot reproduce the above problems, but I'm on Linux. I suspect this has nothing to do with stringi.

Have you tried replacing \r\n with \n manually? Are you using the most recent version on pdftools?

Can you reproduce the above with the most recent version of R (on another computer)?

gagolews avatar Jun 02 '21 05:06 gagolews

I'll see if I can find another machine to trial it but I'm not convinced by the logic of your first two statements. I don't think it logically follows that it didn't happen on Linux implies it isn't the library's fault when the issue happens on Windows.

totalgit74 avatar Jun 02 '21 22:06 totalgit74

What I have said is that I cannot reproduce this error on Linux and hence it is difficult for me to help you. Still, I would like to find a solution to the above.

You have also not answered my question whether replacing \r\n with \n solves the issue.

One other option: could you please serialise the text objects in both settings (interactive and non-interactive) and post their dumps here? E.g., via dump("text", file="output_path...").

gagolews avatar Jun 02 '21 23:06 gagolews

Hi @gagolews ,

I've encountered the same behaviour as @totalgit74 - in my case with the function stri_replace_all_fixed.

However, I found the solution to my problem, which doesn't lie within stringi, but the source() command itself: https://stackoverflow.com/questions/5031630/how-to-source-r-file-saved-using-utf-8-encoding/

Basically, switching from source('filename.r') to eval(parse('filename.r', encoding = 'UTF-8')) solved my problem.

I know it's a bit off-topic and not related to stringi, but I thought it would maybe be useful to others searching the same problem.

Best, Martin

MartinGuth avatar Sep 16 '22 15:09 MartinGuth

(closing due to inactivity)

gagolews avatar Nov 07 '23 00:11 gagolews