thinc
thinc copied to clipboard
Segfault possibly caused by changes to `MaxOut` in v8.1.1
Problem
We have recently (since Sep 9th) been experiencing intermittent seg faults in our (Alibi's) Windows CI. These occur in some of our tests that use the en_core_web_md pipeline.
We believe we have narrowed down the cause to the Tok2Vec
component. Since our errors started on the same day v8.1.1 was released we are wondering if the changes to MaxOut
are the cause (https://github.com/explosion/thinc/pull/702)? Or perhaps the move to blis
v0.9 (https://github.com/explosion/thinc/pull/736)?
We've struggled to come up with a MWE, but for a comparison, we've repeated our CI 30x with v8.1.0 and 30x with v8.1.1. With v8.1.0 we experience no failures, whilst with v8.1.1 we experience seg faults >20% of the time.
Error traceback
From this CI workflow:
Windows fatal exception: access violation
Thread 0x00001a38 (most recent call first):
File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\threading.py", line 316 in wait
File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\threading.py", line 581 in wait
File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\tqdm\_monitor.py", line 60 in run
File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\threading.py", line 980 in _bootstrap_inner
File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\threading.py", line 937 in _bootstrap
Current thread 0x00000760 (most recent call first):
File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\spacy\ml\staticvectors.py", line 56 in forward
File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\thinc\model.py", line 291 in __call__
File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\thinc\layers\concatenate.py", line 44 in <listcomp>
File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\thinc\layers\concatenate.py", line 44 in forward
File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\thinc\model.py", line 291 in __call__
File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\thinc\layers\chain.py", line 55 in forward
File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\thinc\model.py", line 291 in __call__
File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\thinc\layers\chain.py", line 55 in forward
File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\thinc\model.py", line 315 in predict
File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\spacy\pipeline\tok2vec.py", line 125 in predict
File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\spacy\language.py", line 1020 in __call__
File "D:\a\alibi\alibi\alibi\explainers\tests\test_anchor_text.py", line 326 in test_lm_stopwords_punctuation
File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\_pytest\python.py", line 192 in pytest_pyfunc_call
File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\pluggy\_callers.py", line 39 in _multicall
File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\pluggy\_manager.py", line 80 in _hookexec
File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\pluggy\_hooks.py", line 265 in __call__
File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\_pytest\python.py", line [176](https://github.com/SeldonIO/alibi/actions/runs/3025334987/jobs/4867687688#step:7:177)1 in runtest
File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\_pytest\runner.py", line 166 in pytest_runtest_call
File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\pluggy\_callers.py", line 39 in _multicall
File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\pluggy\_manager.py", line 80 in _hookexec
File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\pluggy\_hooks.py", line 265 in __call__
File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\_pytest\runner.py", line 259 in <lambda>
File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\_pytest\runner.py", line 338 in from_call
File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\_pytest\runner.py", line 258 in call_runtest_hook
File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\_pytest\runner.py", line 219 in call_and_report
File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\_pytest\runner.py", line 130 in runtestprotocol
File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\_pytest\runner.py", line 111 in pytest_runtest_protocol
File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\pluggy\_callers.py", line 39 in _multicall
File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\pluggy\_manager.py", line 80 in _hookexec
File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\pluggy\_hooks.py", line 265 in __call__
File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\_pytest\main.py", line 347 in pytest_runtestloop
File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\pluggy\_callers.py", line 39 in _multicall
File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\pluggy\_manager.py", line 80 in _hookexec
File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\pluggy\_hooks.py", line 265 in __call__
File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\_pytest\main.py", line 322 in _main
File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\_pytest\main.py", line 268 in wrap_session
File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\_pytest\main.py", line 315 in pytest_cmdline_main
File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\pluggy\_callers.py", line 39 in _multicall
File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\pluggy\_manager.py", line 80 in _hookexec
File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\pluggy\_hooks.py", line 265 in __call__
File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\_pytest\config\__init__.py", line 164 in main
File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\_pytest\config\__init__.py", line [187](https://github.com/SeldonIO/alibi/actions/runs/3025334987/jobs/4867687688#step:7:188) in console_main
File "C:\hostedtoolcache\windows\Python\3.9.13\x64\Scripts\pytest.exe\__main__.py", line 7 in <module>
File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\runpy.py", line 87 in _run_code
File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\runpy.py", line 197 in _run_module_as_main
D:\a\_temp\511c6489-14db-4df4-820f-7b5[188](https://github.com/SeldonIO/alibi/actions/runs/3025334987/jobs/4867687688#step:7:189)c57acc.sh: line 2: 1[224](https://github.com/SeldonIO/alibi/actions/runs/3025334987/jobs/4867687688#step:7:225) Segmentation fault pytest -m "not tf1" alibi
..........................................XXX......ssss.......
Error: Process completed with exit code 139.
Platform, versions etc
- Platform: Github runner image
windows-2022
, version20220905.1
- Python versions tested:
3.9.12
,3.9.13
- pip env: see here
As further info, looking at the traceback again:
File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\spacy\ml\staticvectors.py", line 56 in forward
@RobertSamoilescu has noticed the failure appears to originate in staticvectors.py
here. We're therefore wondering if the recent change to gemm
(https://github.com/explosion/cython-blis/pull/72) is the cause? Please let us know if it would be better to open an issue over on cython-blis
, or if we can provide more info.
@RobertSamoilescu has noticed the failure appears to originate in
staticvectors.py
here. We're therefore wondering if the recent change togemm
(explosion/cython-blis#72) is the cause?
Unlikely, using uninitialized memory erroneously wouldn't cause segfaults, only garbage output. We found a bunch of issues in BLIS 0.9 where gemm
would read out-of-bound. These were fixed upstream and the fixes were added to cython-blis
. Since we thought the issues in BLIS were fixed, we changed the upper pin to include cython-blis
0.9.x in Thinc 8.1.1. But it seems like there are more memory issues 😢.
Thanks for reporting this!
Can you check whether you still see the same issue in an environment that has thinc 0.8.1 + blis 0.7.8? That would let us confirm for sure whether the issue is coming from thinc or blis.
Hi @danieldk and @njsmith, thanks for the quick responses. That's a nice idea re blis 0.7.8, we shall do that now and come back to you.
I've run some more CI runs here. The results are as follows:
- CI #430 to #460 is with
thinc 8.1.1
andblis 0.7.8
. There were no failures in 30x runs. - CI #462 to #491 is with
thinc 8.1.1
andblis 0.9.1
. There were 5x failures in 30x runs.
The error is rather intermittent so not 100% conclusive but this does suggest an issue with more recent blis
versions.
Hi! We've released thinc 8.1.2 today which restricts blis to <0.8.0
again instead of <0.10.0
. Thanks again for your report!
Thanks for the update @svlandeg!
@ascillitoe Could you reproduce the crash with the following env variable and post the stdout
output of the run: BLIS_ARCH_DEBUG=1
? This will help us narrow down which BLIS kernel is faulty - you should something along the lines of libblis: selecting sub-configuration 'haswell'.
in the log.
@ascillitoe Could you reproduce the crash with the following env variable and post the
stdout
output of the run:BLIS_ARCH_DEBUG=1
? This will help us narrow down which BLIS kernel is faulty - you should something along the lines oflibblis: selecting sub-configuration 'haswell'.
in the log.
Sure thing, I'll have a go!
Hi again. I've set some more CI runs going here (508 to 518).
Interesting, it looks like all the runs that fail (like this one) have:
libblis: selecting sub-configuration 'haswell'.
whereas the runs that pass (like this one) have:
libblis: Hardware has 2 FMA units; using 'skx' sub-config.
libblis: selecting sub-configuration 'skx'.
So it looks like it might be the haswell
kernels?
Thanks for the quick response! Yeah, the OOB access is in one of the haswell
kernels, which confirms our suspicion.