spacyr icon indicating copy to clipboard operation
spacyr copied to clipboard

Disable parallelism for huggingface/tokenizers

Open fgeeri opened this issue 2 years ago • 0 comments

When using the transformers-based model de_dep_news_trf I get a huggingface/tokenizers warning message in the console:

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

It appears that the underlying parallelism issue in huggingface/transformers might be getting in the foreseeable future, as there was a commit that seems to address this issue just this week.

However, for the time until this will be released in transformers and spaCy, is there a way to set the mentioned environment variable when using spaCy through spacyr? The warning is printed in the console repeatedly until the R session is restarted, which is a nuisance. Setting spacy_tokenize(x, multithread = FALSE) does not influence the warning.

Click for details and instructions on how to reproduce the warning

The warning message appears when using spacy_tokenize(x, what = "sentence"), but does not show up when using what = "words". The message is printed as black text like console output, not as blue text like normal R warnings.

The message seems to be printed again and again repeatedly, but not very frequently, maybe once a minute. The message keeps appearing after I've called spacy_finalize(). Only restarting the R session stops the warning. Setting the multithread argument in spacy_tokenize does not influence whether the warning appears.

I can consistently reproduce the warning by executing the following code and then saving the code file in RStudio (the warning only appears on saving).

library(spacyr)

text_taxi <- "Franz jagt im komplett verwahrlosten Taxi quer durch Bayern. Franz jagt im komplett verwahrlosten Taxi."

spacy_initialize(model = "de_dep_news_trf")

spacy_tokenize(text_taxi, 
               what = "sentence", 
               multithread = FALSE,
               output = "data.frame")[,2]

spacy_finalize()

# Now save the code file

fgeeri avatar Nov 18 '22 14:11 fgeeri