Kenneth Benoit
Kenneth Benoit
Added to the issue: a script contributed by Arthur Stenzel (thanks Arthur!). ```r # TIKA Script # Andreas Niekler # Gregor Wiedemann # =========== # Define function to extract text...
File is here: [01_er_5.txt](https://github.com/quanteda/readtext/files/2599099/01_er_5.txt)
But also I fixed it so that the file is there now, but use https://kenbenoit.net/files/01_er_5.txt
Working again but need to change https to http until I get another SSL certificate.
Hi @lmullen, just getting back to this now that I have time. We're also preparing a CRAN release. I'd love to gain 30x more performance on the most commonly read...
I experimented with this in a branch, and it's trickier than it looks. Yes `readr::read_file()` is faster, but to do it with file-by-file encoding slows down the speed gains considerably...
I'm putting this on the long list for the next release.
@louislegum I think the above discussion has identified the issue as being some non-standard metadata issue, but I'd be happy to take a look nonetheless. Can you send me an...
Note: It's only called `encoding2` to prevent NAMESPACE conflicts with **quanteda**. Let's drop the "2" once we remove the original function from **quanteda**.
Parsing the entire 241 document SOTU corpus worked (in 2022!) on my M1 Max mac (with 64GB RAM) after a few minutes. The resulting fully parsed object has > 2.2m...