Kenneth Benoit comments

Results 264 comments of


                                            Kenneth Benoit

Add support for Apache Tika types

Added to the issue: a script contributed by Arthur Stenzel (thanks Arthur!). ```r # TIKA Script # Andreas Niekler # Gregor Wiedemann # =========== # Define function to extract text...

URL to file for encoding() example invalid

File is here: [01_er_5.txt](https://github.com/quanteda/readtext/files/2599099/01_er_5.txt)

URL to file for encoding() example invalid

But also I fixed it so that the file is there now, but use https://kenbenoit.net/files/01_er_5.txt

URL to file for encoding() example invalid

Working again but need to change https to http until I get another SSL certificate.

Performance gains using readr::read_files()

Hi @lmullen, just getting back to this now that I have time. We're also preparing a CRAN release. I'd love to gain 30x more performance on the most commonly read...

Performance gains using readr::read_files()

I experimented with this in a branch, and it's trickier than it looks. Yes `readr::read_file()` is faster, but to do it with file-by-file encoding slows down the speed gains considerably...

Encoding handling not handled by stringi and possibly inconsistent

I'm putting this on the long list for the next release.

importing docvars from docx files

@louislegum I think the above discussion has identified the issue as being some non-standard metadata issue, but I'd be happy to take a look nonetheless. Can you send me an...

Add tests for encoding()

Note: It's only called `encoding2` to prevent NAMESPACE conflicts with **quanteda**. Let's drop the "2" once we remove the original function from **quanteda**.

Implement batch size and n_thread options

Parsing the entire 241 document SOTU corpus worked (in 2022!) on my M1 Max mac (with 64GB RAM) after a few minutes. The resulting fully parsed object has > 2.2m...