LexisNexisTools icon indicating copy to clipboard operation
LexisNexisTools copied to clipboard

Bug in lnt_parse_uni

Open JaySLee opened this issue 5 years ago • 3 comments

First of all, thank you Johannes for this very useful package!

I found something that appears to be a bug in the lnt_parse_uni function where rle() is called. The conditional that follows after will result in some data's being omitted if three blank lines are found. Specifically, the conditional results in the removal all the data (articles/lines) preceding the first instance of three blank lines. I'm curious if this is intentional. Thanks!

JaySLee avatar Jun 29 '20 14:06 JaySLee

Thanks for taking an interest in the package. You're actually the first person who was ever interested in that part, so I'm a bit excited to explain. Whether this is a bug or intended behaviour depends on if you get the correct results for yourself For all my test cases this does behave correctly.

Luckily I left a comment in the code, otherwise I probably wouldn't remember:

https://github.com/JBGruber/LexisNexisTools/blob/ac84e4a9359d75b8d9d6ac589d632c788b56fdee/R/LexisNexisTools.R#L608-L619

The rle() part basically splits a document into the individual articles. I'll use the sample from the package to run what (basically) happens when you run the command:

library(LexisNexisTools)

# internal command to read lines from the docx file
lines <- LexisNexisTools:::lnt_read_lines(lnt_sample("docx", copy = FALSE))$uni
# here comes the rle
l <- rle(lines)
# two consecutive empty lines indicate break between articles
l$article <- cumsum(l$lengths > 2 & l$values == "")
# use table to show how many lines are in each article
table(l$article)
#> 
#>   0   1   2   3   4   5   6   7   8   9  10  11 
#> 113  22  22  18  24  20  29  27  30  29  22   1

As you can see from this, the article 0 contains 113 lines. If we check this, we see a lot of strange garbage. Opening the actual file you'll see though what this article 0 contains:

lnt_sample("docx", copy = FALSE)
#> [1] "/home/johannes/R/x86_64-pc-linux-gnu-library/4.0/LexisNexisTools/extdata/sample.DOCX"

Each document from LexisUni starts with a cover page outlining what's in the file. This information is useless for us as the articles are brought into a table format anyway. So the first entry is removed. I don't remember why I didn't just do l <- l$values[l$article != 0], which would have done the same. But I think there was a reason for it.

If these lines remove more for you than the cover page, I have to look into it again. But so far I'm unaware of cases which do not come with this useless information before the actual newspaper data.

JBGruber avatar Jun 29 '20 19:06 JBGruber

Thanks for the elaborate response! I'll investigate your test files further when I have a chance. In the mean time, attached is my test file where the lnt_read omits the 1st 5 articles and starts part way through the 6th (just after the 3 blank lines). One thing you mentioned was the 'cover page', which I 'uncheck' for my Nexis Uni downloads. I'll test further to see if that makes a difference. Files(16).DOCX

JaySLee avatar Jul 01 '20 13:07 JaySLee

Looking at your file this might be a big coincidence or indeed a bug. Anyway, LexisNexisTools doesn't do the right thing here. I have a number of test file without cover page but none of these have double blank lines (as after the cover page). As a quick and easy fix I added a new argument to lnt_read which disables this part of the code, remove_cover:

library(LexisNexisTools)
#> LexisNexisTools Version 0.3.1.9000
lntoutput1 <- lnt_read("Files.16.DOCX", convert_date = FALSE, verbose = FALSE)
nrow(lntoutput1)
#> Articles 
#>       11
lntoutput2 <- lnt_read("Files.16.DOCX", remove_cover = FALSE, verbose = FALSE)
nrow(lntoutput2)
#> Articles 
#>       16

When remove_cover = FALSE, your file is parsed just fine. Just install the development version (remotes::install_github("JBGruber/LexisNexisTools")) and it should work.

JBGruber avatar Jul 01 '20 16:07 JBGruber