spacyr
spacyr copied to clipboard
Disable parallelism for huggingface/tokenizers
When using the transformers-based model de_dep_news_trf
I get a huggingface/tokenizers warning message in the console:
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
It appears that the underlying parallelism issue in huggingface/transformers might be getting in the foreseeable future, as there was a commit that seems to address this issue just this week.
However, for the time until this will be released in transformers and spaCy, is there a way to set the mentioned environment variable when using spaCy through spacyr? The warning is printed in the console repeatedly until the R session is restarted, which is a nuisance. Setting spacy_tokenize(x, multithread = FALSE)
does not influence the warning.
Click for details and instructions on how to reproduce the warning
The warning message appears when using spacy_tokenize(x, what = "sentence")
, but does not show up when using what = "words"
. The message is printed as black text like console output, not as blue text like normal R warnings.
The message seems to be printed again and again repeatedly, but not very frequently, maybe once a minute. The message keeps appearing after I've called spacy_finalize()
. Only restarting the R session stops the warning. Setting the multithread argument in spacy_tokenize does not influence whether the warning appears.
I can consistently reproduce the warning by executing the following code and then saving the code file in RStudio (the warning only appears on saving).
library(spacyr)
text_taxi <- "Franz jagt im komplett verwahrlosten Taxi quer durch Bayern. Franz jagt im komplett verwahrlosten Taxi."
spacy_initialize(model = "de_dep_news_trf")
spacy_tokenize(text_taxi,
what = "sentence",
multithread = FALSE,
output = "data.frame")[,2]
spacy_finalize()
# Now save the code file