gensim
gensim copied to clipboard
CoherenceModel does not finish with computing
Problem description
When computing coherence scores, it newer finishes with computing on a bit bigger dataset. Run the code below (with the provided dataset) to reproduce.
Steps/code/corpus to reproduce
with open("coherence-bug.pkl", "rb") as f:
model, tokens = pickle.load(f)
print("conherence")
print(datetime.now())
t = time.time()
cm = CoherenceModel(model=model, texts=tokens, coherence="c_v")
coherence = cm.get_coherence()
print(time.time() - t)
Versions
The bug appears on Gensim version 4.2, but it does not happen on 4.1.2
macOS-10.16-x86_64-i386-64bit Python 3.8.12 (default, Oct 12 2021, 06:23:56) [Clang 10.0.0 ] Bits 64 NumPy 1.22.3 SciPy 1.8.1 gensim 4.2.1.dev0 FAST_VERSION 0
@silviatti could you check this one please? #3197 was the only change in CoherenceModel, although I don't see how it's related.
@PrimozGodec could you interrupt your stuck computation with ctrl-c and post the traceback? Thanks.
Thank you for your fast response. Here is the traceback.
Process AccumulatingWorker-1:
Process AccumulatingWorker-2:
Process AccumulatingWorker-3:
Traceback (most recent call last):
File "/Users/primoz/miniconda3/envs/orange3/lib/python3.8/multiprocessing/process.py", line 318, in _bootstrap
util._exit_function()
File "/Users/primoz/miniconda3/envs/orange3/lib/python3.8/multiprocessing/util.py", line 360, in _exit_function
_run_finalizers()
File "/Users/primoz/miniconda3/envs/orange3/lib/python3.8/multiprocessing/util.py", line 300, in _run_finalizers
finalizer()
File "/Users/primoz/miniconda3/envs/orange3/lib/python3.8/multiprocessing/util.py", line 224, in __call__
res = self._callback(*self._args, **self._kwargs)
File "/Users/primoz/miniconda3/envs/orange3/lib/python3.8/multiprocessing/queues.py", line 195, in _finalize_join
thread.join()
File "/Users/primoz/miniconda3/envs/orange3/lib/python3.8/threading.py", line 1011, in join
self._wait_for_tstate_lock()
File "/Users/primoz/miniconda3/envs/orange3/lib/python3.8/threading.py", line 1027, in _wait_for_tstate_lock
elif lock.acquire(block, timeout):
KeyboardInterrupt
Traceback (most recent call last):
File "/Users/primoz/miniconda3/envs/orange3/lib/python3.8/multiprocessing/process.py", line 318, in _bootstrap
util._exit_function()
File "/Users/primoz/miniconda3/envs/orange3/lib/python3.8/multiprocessing/util.py", line 360, in _exit_function
_run_finalizers()
File "/Users/primoz/miniconda3/envs/orange3/lib/python3.8/multiprocessing/util.py", line 300, in _run_finalizers
finalizer()
File "/Users/primoz/miniconda3/envs/orange3/lib/python3.8/multiprocessing/util.py", line 224, in __call__
res = self._callback(*self._args, **self._kwargs)
File "/Users/primoz/miniconda3/envs/orange3/lib/python3.8/multiprocessing/queues.py", line 195, in _finalize_join
thread.join()
File "/Users/primoz/miniconda3/envs/orange3/lib/python3.8/threading.py", line 1011, in join
self._wait_for_tstate_lock()
File "/Users/primoz/miniconda3/envs/orange3/lib/python3.8/threading.py", line 1027, in _wait_for_tstate_lock
elif lock.acquire(block, timeout):
KeyboardInterrupt
Traceback (most recent call last):
File "/Users/primoz/miniconda3/envs/orange3/lib/python3.8/multiprocessing/process.py", line 318, in _bootstrap
util._exit_function()
File "/Users/primoz/miniconda3/envs/orange3/lib/python3.8/multiprocessing/util.py", line 360, in _exit_function
_run_finalizers()
File "/Users/primoz/miniconda3/envs/orange3/lib/python3.8/multiprocessing/util.py", line 300, in _run_finalizers
finalizer()
File "/Users/primoz/miniconda3/envs/orange3/lib/python3.8/multiprocessing/util.py", line 224, in __call__
res = self._callback(*self._args, **self._kwargs)
File "/Users/primoz/miniconda3/envs/orange3/lib/python3.8/multiprocessing/queues.py", line 195, in _finalize_join
thread.join()
File "/Users/primoz/miniconda3/envs/orange3/lib/python3.8/threading.py", line 1011, in join
self._wait_for_tstate_lock()
File "/Users/primoz/miniconda3/envs/orange3/lib/python3.8/threading.py", line 1027, in _wait_for_tstate_lock
elif lock.acquire(block, timeout):
KeyboardInterrupt
thanks for this, I've litterally learned orange for topic modelling, and now is unusable for me.
@silviatti could you check this one please? #3197 was the only change in CoherenceModel, although I don't see how it's related.
@PrimozGodec could you interrupt your stuck computation with ctrl-c and post the traceback? Thanks.
Thanks, I've literally learned Orange and all this for topic modeling and all my projects are now in stand-by for two months, already.
I also had the issue that the cm.get_coherence() call would not terminate for larger lists of texts. Here's how I "fixed" it:
The base issue was not actually due to Gensim, but a problem on my end. It's just that (I presume) due to multiprocessing Gensim did not properly raise an error but simply never terminated. You can find out if you have the same problem using the following steps:
- Disable multiprocessing:
cm = CoherenceModel(model=model, texts=tokens, coherence="c_v", processes=1) - Rerun the code and see if there is an error.
For me it was an IndexError: There was a bug in my upstream code and some empty texts, i.e. empty lists, snuck into the final tokens list. If you also have an error, then you can might be able to fix it like this:
3. Identify the underlying upstream bug and fix it.
4. Rerun the code. cm.get_coherence() should now return a coherence!
I hope this helps! Though I'm not sure if @PrimozGodec and @nadiaelen are facing the same issue, the root cause might still lie somewhere with multiprocessing.
@felixrech thank you for the suggestion. When switching to processes=1 I fond the error.