tokenizers Error: ThreadPoolBuildError

Hi,

I am using sentence-transformers which works well on the local machine but on the cloud(Compute Canada) throwing error ThreadPoolBuildError, The global thread pool has not been initialized. Q: How do initialize the thread pool? is it related to tokenizer?

Environment: Compute Canada Python: 3.9.6

Facing error in compute Canada with memory 20000MB

Package versions huggingface-hub 0.4.0 nltk 3.6.7 numpy 1.21.2+computecanada pandas 1.3.0+computecanada scikit-learn 1.0.1+computecanada scipy 1.7.3+computecanada sentence-transformers 2.1.0+computecanada sentencepiece 0.1.96+computecanada tokenizers 0.10.3+computecanada torch 1.10.0+computecanada torchvision 0.11.1+computecanada transformers 4.16.1

CODE: from sentence_transformers import SentenceTransformer, util import torch model = SentenceTransformer('all-MiniLM-L6-v2')

text = ['The cat sits outside']

text_embeddings = model.encode(text)

Error: Ignored unknown kwarg option direction thread '' panicked at 'The global thread pool has not been initialized.: ThreadPoolBuildError { kind: IOError(Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }) }', /home/coulombc/.cargo/registry/src/github.com-1ecc6299db9ec823/rayon-core-1.9.1/src/registry.rs:170:10 note: run with RUST_BACKTRACE=1 environment variable to display a backtrace Traceback (most recent call last): File "", line 1, in File "/home/harshx/.local/lib/python3.9/site-packages/sentence_transformers/SentenceTransformer.py", line 153, in encode features = self.tokenize(sentences_batch) File "/home/harshx/.local/lib/python3.9/site-packages/sentence_transformers/SentenceTransformer.py", line 311, in tokenize return self._first_module().tokenize(texts) File "/home/harshx/.local/lib/python3.9/site-packages/sentence_transformers/models/Transformer.py", line 99, in tokenize output.update(self.tokenizer(*to_tokenize, padding=True, truncation='longest_first', return_tensors="pt", max_length=self.max_seq_length)) File "/home/harshx/.local/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2443, in call return self.batch_encode_plus( File "/home/harshx/.local/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2634, in batch_encode_plus return self._batch_encode_plus( File "/home/harshx/.local/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 424, in _batch_encode_plus encodings = self._tokenizer.encode_batch( pyo3_runtime.PanicException: The global thread pool has not been initialized.: ThreadPoolBuildError { kind: IOError(Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }) }

Feb 01 '22 17:02 HarshkumarP

Can you try using your script with TOKENIZERS_PARALLELISM=false enabled ? This should deactivate parallelism within tokenizers and removing the error. If this works, you can write your code single threaded first, and try to move to multi threaded/multi processed directly in Python, so it can be easier for you to understand what goes on.

I don't really know the computecanada environment, is there more information out there ? Is it linux based, Posix, that sort of thing ? Maybe parallelism is actually disabled ?

Feb 02 '22 09:02 Narsil

I got the same problem but setting TOKENIZERS_PARALLELISM=false didn't solve it for me.. Has anybody been able to solve it differently please? x(

Jun 01 '23 07:06 sarrahbbh

@sarrahbbh Please re-open an issue with the appropriate details to reproduce it

Jun 07 '23 07:06 Narsil

@Narsil actually using a smaller number of GPUs than the ones available in my VM solved the issue for me (I guess this is the way the framework I used is implemented?) but thank you anyway!

Jun 07 '23 07:06 sarrahbbh

tokenizers tokenizers copied to clipboard

Error: ThreadPoolBuildError

tokenizers
tokenizers copied to clipboard