scispacy
scispacy copied to clipboard
rxnorm linker doesn't work with multiprocessing?
Hi, I'm getting an error trying to run nlp.pipe
with n_processes > 1
, I think because the pickling that multiprocessing
does under the hood interacts poorly with nmslib.dist.FloatIndex
, which the rxnorm entity linker requires and does not seem picklable.
Minimal code:
import spacy
import scispacy
from scispacy.linking import EntityLinker
TEXTS = ["Hello! This is document 1.", "And here's doc 2."]
if __name__ == '__main__':
nlp = spacy.load("en_core_sci_sm")
nlp.add_pipe("scispacy_linker", config={"resolve_abbreviations": True,
"linker_name": "rxnorm"})
for doc in nlp.pipe(TEXTS, n_process=2):
print(doc)
Running with Python 3.8.5 gives me:
Traceback (most recent call last):
File "./mwerror.py", line 13, in <module>
for doc in nlp.pipe(TEXTS, n_process=2):
File ".../python3.8/site-packages/spacy/language.py", line 1479, in pipe
for doc in docs:
File ".../python3.8/site-packages/spacy/language.py", line 1515, in _multiprocessing_pipe
proc.start()
File ".../python3.8/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File ".../python3.8/multiprocessing/context.py", line 224, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File ".../python3.8/multiprocessing/context.py", line 284, in _Popen
return Popen(process_obj)
File ".../python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__
super().__init__(process_obj)
File ".../python3.8/multiprocessing/popen_fork.py", line 19, in __init__
self._launch(process_obj)
File ".../python3.8/multiprocessing/popen_spawn_posix.py", line 47, in _launch
reduction.dump(process_obj, fp)
File ".../python3.8/multiprocessing/reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
TypeError: cannot pickle 'nmslib.dist.FloatIndex' object
Note I don't get an error with n_process=1
, presumably because multiprocessing
is not invoked.
I also do not get this error if I don't include the linker pipe (i.e. comment out the add_pipe()
line above).
Thanks! This lib is great!
Hey, seems like it works as expected (i.e. doesn't crash) on linux? Error above was from running on OSX 10.14.6.
(FYI I suspect it might something to do with multiprocessing using spawn rather than fork by default on OSX as of py3.8 [doc link] but IDK)
Interesting, not sure off the top of my head. Leaving this open for now, let me know if you happen to resolve anything. At a minimum, you could do the parallelization yourself, but ideally it would work with spacy's parallelization.
I actually initially tried doing the parallelization myself with joblib, calling nlp()
inside the parallelized code, and it gave me the same error as the spacy nlp.pipe
snippet I posted.
Will let you know if I come across anything, but it seems to work fine on linux fwiw.