LexisNexisTools
LexisNexisTools copied to clipboard
Bug in lnt_parse_uni
First of all, thank you Johannes for this very useful package!
I found something that appears to be a bug in the lnt_parse_uni function where rle() is called. The conditional that follows after will result in some data's being omitted if three blank lines are found. Specifically, the conditional results in the removal all the data (articles/lines) preceding the first instance of three blank lines. I'm curious if this is intentional. Thanks!
Thanks for taking an interest in the package. You're actually the first person who was ever interested in that part, so I'm a bit excited to explain. Whether this is a bug or intended behaviour depends on if you get the correct results for yourself For all my test cases this does behave correctly.
Luckily I left a comment in the code, otherwise I probably wouldn't remember:
https://github.com/JBGruber/LexisNexisTools/blob/ac84e4a9359d75b8d9d6ac589d632c788b56fdee/R/LexisNexisTools.R#L608-L619
The rle() part basically splits a document into the individual articles. I'll use the sample from the package to run what (basically) happens when you run the command:
library(LexisNexisTools)
# internal command to read lines from the docx file
lines <- LexisNexisTools:::lnt_read_lines(lnt_sample("docx", copy = FALSE))$uni
# here comes the rle
l <- rle(lines)
# two consecutive empty lines indicate break between articles
l$article <- cumsum(l$lengths > 2 & l$values == "")
# use table to show how many lines are in each article
table(l$article)
#>
#> 0 1 2 3 4 5 6 7 8 9 10 11
#> 113 22 22 18 24 20 29 27 30 29 22 1
As you can see from this, the article 0 contains 113 lines. If we check this, we see a lot of strange garbage. Opening the actual file you'll see though what this article 0 contains:
lnt_sample("docx", copy = FALSE)
#> [1] "/home/johannes/R/x86_64-pc-linux-gnu-library/4.0/LexisNexisTools/extdata/sample.DOCX"
Each document from LexisUni starts with a cover page outlining what's in the file. This information is useless for us as the articles are brought into a table format anyway. So the first entry is removed. I don't remember why I didn't just do l <- l$values[l$article != 0], which would have done the same. But I think there was a reason for it.
If these lines remove more for you than the cover page, I have to look into it again. But so far I'm unaware of cases which do not come with this useless information before the actual newspaper data.
Thanks for the elaborate response! I'll investigate your test files further when I have a chance. In the mean time, attached is my test file where the lnt_read omits the 1st 5 articles and starts part way through the 6th (just after the 3 blank lines). One thing you mentioned was the 'cover page', which I 'uncheck' for my Nexis Uni downloads. I'll test further to see if that makes a difference. Files(16).DOCX
Looking at your file this might be a big coincidence or indeed a bug. Anyway, LexisNexisTools doesn't do the right thing here. I have a number of test file without cover page but none of these have double blank lines (as after the cover page). As a quick and easy fix I added a new argument to lnt_read which disables this part of the code, remove_cover:
library(LexisNexisTools)
#> LexisNexisTools Version 0.3.1.9000
lntoutput1 <- lnt_read("Files.16.DOCX", convert_date = FALSE, verbose = FALSE)
nrow(lntoutput1)
#> Articles
#> 11
lntoutput2 <- lnt_read("Files.16.DOCX", remove_cover = FALSE, verbose = FALSE)
nrow(lntoutput2)
#> Articles
#> 16
When remove_cover = FALSE, your file is parsed just fine. Just install the development version (remotes::install_github("JBGruber/LexisNexisTools")) and it should work.