spacyr icon indicating copy to clipboard operation
spacyr copied to clipboard

Implement batch size and n_thread options

Open kbenoit opened this issue 4 years ago • 1 comments

This is fine:

> library("spacyr")
> sp <- spacy_parse(data_char_sentences)
spacy python option is already set, spacyr will use:
	condaenv = "spacy_condaenv"
successfully initialized (spaCy Version: 2.1.8, language model: en)
(python options: type = "condaenv", value = "spacy_condaenv")

but this crashes R:

library("spacyr")

data(data_corpus_sotu, package = "quanteda.corpora")
sp <- spacy_parse(quanteda::texts(data_corpus_sotu))

kbenoit avatar Feb 27 '20 00:02 kbenoit

Parsing the entire 241 document SOTU corpus worked (in 2022!) on my M1 Max mac (with 64GB RAM) after a few minutes. The resulting fully parsed object has > 2.2m tokens.

But it would be nice to implement the multi-threading arguments, including batch_size, documented in https://spacy.io/usage/processing-pipelines#multiprocessing.

kbenoit avatar Sep 01 '22 08:09 kbenoit