spacyr
spacyr copied to clipboard
Implement batch size and n_thread options
This is fine:
> library("spacyr")
> sp <- spacy_parse(data_char_sentences)
spacy python option is already set, spacyr will use:
condaenv = "spacy_condaenv"
successfully initialized (spaCy Version: 2.1.8, language model: en)
(python options: type = "condaenv", value = "spacy_condaenv")
but this crashes R:
library("spacyr")
data(data_corpus_sotu, package = "quanteda.corpora")
sp <- spacy_parse(quanteda::texts(data_corpus_sotu))
Parsing the entire 241 document SOTU corpus worked (in 2022!) on my M1 Max mac (with 64GB RAM) after a few minutes. The resulting fully parsed object has > 2.2m tokens.
But it would be nice to implement the multi-threading arguments, including batch_size, documented in https://spacy.io/usage/processing-pipelines#multiprocessing.