thinc icon indicating copy to clipboard operation
thinc copied to clipboard

Segfault possibly caused by changes to `MaxOut` in v8.1.1

Open ascillitoe opened this issue 2 years ago • 11 comments

Problem

We have recently (since Sep 9th) been experiencing intermittent seg faults in our (Alibi's) Windows CI. These occur in some of our tests that use the en_core_web_md pipeline.

We believe we have narrowed down the cause to the Tok2Vec component. Since our errors started on the same day v8.1.1 was released we are wondering if the changes to MaxOut are the cause (https://github.com/explosion/thinc/pull/702)? Or perhaps the move to blis v0.9 (https://github.com/explosion/thinc/pull/736)?

We've struggled to come up with a MWE, but for a comparison, we've repeated our CI 30x with v8.1.0 and 30x with v8.1.1. With v8.1.0 we experience no failures, whilst with v8.1.1 we experience seg faults >20% of the time.

Error traceback

From this CI workflow:

Windows fatal exception: access violation

Thread 0x00001a38 (most recent call first):
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\threading.py", line 316 in wait
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\threading.py", line 581 in wait
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\tqdm\_monitor.py", line 60 in run
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\threading.py", line 980 in _bootstrap_inner
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\threading.py", line 937 in _bootstrap

Current thread 0x00000760 (most recent call first):
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\spacy\ml\staticvectors.py", line 56 in forward
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\thinc\model.py", line 291 in __call__
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\thinc\layers\concatenate.py", line 44 in <listcomp>
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\thinc\layers\concatenate.py", line 44 in forward
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\thinc\model.py", line 291 in __call__
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\thinc\layers\chain.py", line 55 in forward
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\thinc\model.py", line 291 in __call__
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\thinc\layers\chain.py", line 55 in forward
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\thinc\model.py", line 315 in predict
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\spacy\pipeline\tok2vec.py", line 125 in predict
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\spacy\language.py", line 1020 in __call__
  File "D:\a\alibi\alibi\alibi\explainers\tests\test_anchor_text.py", line 326 in test_lm_stopwords_punctuation
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\_pytest\python.py", line 192 in pytest_pyfunc_call
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\pluggy\_callers.py", line 39 in _multicall
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\pluggy\_manager.py", line 80 in _hookexec
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\pluggy\_hooks.py", line 265 in __call__
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\_pytest\python.py", line [176](https://github.com/SeldonIO/alibi/actions/runs/3025334987/jobs/4867687688#step:7:177)1 in runtest
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\_pytest\runner.py", line 166 in pytest_runtest_call
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\pluggy\_callers.py", line 39 in _multicall
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\pluggy\_manager.py", line 80 in _hookexec
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\pluggy\_hooks.py", line 265 in __call__
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\_pytest\runner.py", line 259 in <lambda>
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\_pytest\runner.py", line 338 in from_call
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\_pytest\runner.py", line 258 in call_runtest_hook
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\_pytest\runner.py", line 219 in call_and_report
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\_pytest\runner.py", line 130 in runtestprotocol
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\_pytest\runner.py", line 111 in pytest_runtest_protocol
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\pluggy\_callers.py", line 39 in _multicall
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\pluggy\_manager.py", line 80 in _hookexec
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\pluggy\_hooks.py", line 265 in __call__
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\_pytest\main.py", line 347 in pytest_runtestloop
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\pluggy\_callers.py", line 39 in _multicall
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\pluggy\_manager.py", line 80 in _hookexec
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\pluggy\_hooks.py", line 265 in __call__
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\_pytest\main.py", line 322 in _main
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\_pytest\main.py", line 268 in wrap_session
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\_pytest\main.py", line 315 in pytest_cmdline_main
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\pluggy\_callers.py", line 39 in _multicall
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\pluggy\_manager.py", line 80 in _hookexec
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\pluggy\_hooks.py", line 265 in __call__
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\_pytest\config\__init__.py", line 164 in main
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\_pytest\config\__init__.py", line [187](https://github.com/SeldonIO/alibi/actions/runs/3025334987/jobs/4867687688#step:7:188) in console_main
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\Scripts\pytest.exe\__main__.py", line 7 in <module>
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\runpy.py", line 87 in _run_code
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\runpy.py", line 197 in _run_module_as_main
D:\a\_temp\511c6489-14db-4df4-820f-7b5[188](https://github.com/SeldonIO/alibi/actions/runs/3025334987/jobs/4867687688#step:7:189)c57acc.sh: line 2:  1[224](https://github.com/SeldonIO/alibi/actions/runs/3025334987/jobs/4867687688#step:7:225) Segmentation fault      pytest -m "not tf1" alibi
..........................................XXX......ssss.......
Error: Process completed with exit code 139.

Platform, versions etc

  • Platform: Github runner image windows-2022, version 20220905.1
  • Python versions tested: 3.9.12, 3.9.13
  • pip env: see here

ascillitoe avatar Sep 22 '22 09:09 ascillitoe

As further info, looking at the traceback again:

File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\spacy\ml\staticvectors.py", line 56 in forward

@RobertSamoilescu has noticed the failure appears to originate in staticvectors.py here. We're therefore wondering if the recent change to gemm (https://github.com/explosion/cython-blis/pull/72) is the cause? Please let us know if it would be better to open an issue over on cython-blis, or if we can provide more info.

ascillitoe avatar Sep 22 '22 09:09 ascillitoe

@RobertSamoilescu has noticed the failure appears to originate in staticvectors.py here. We're therefore wondering if the recent change to gemm (explosion/cython-blis#72) is the cause?

Unlikely, using uninitialized memory erroneously wouldn't cause segfaults, only garbage output. We found a bunch of issues in BLIS 0.9 where gemm would read out-of-bound. These were fixed upstream and the fixes were added to cython-blis. Since we thought the issues in BLIS were fixed, we changed the upper pin to include cython-blis 0.9.x in Thinc 8.1.1. But it seems like there are more memory issues 😢.

Thanks for reporting this!

danieldk avatar Sep 23 '22 05:09 danieldk

Can you check whether you still see the same issue in an environment that has thinc 0.8.1 + blis 0.7.8? That would let us confirm for sure whether the issue is coming from thinc or blis.

njsmith avatar Sep 23 '22 14:09 njsmith

Hi @danieldk and @njsmith, thanks for the quick responses. That's a nice idea re blis 0.7.8, we shall do that now and come back to you.

ascillitoe avatar Sep 26 '22 09:09 ascillitoe

I've run some more CI runs here. The results are as follows:

  • CI #430 to #460 is with thinc 8.1.1 and blis 0.7.8. There were no failures in 30x runs.
  • CI #462 to #491 is with thinc 8.1.1 and blis 0.9.1. There were 5x failures in 30x runs.

The error is rather intermittent so not 100% conclusive but this does suggest an issue with more recent blis versions.

ascillitoe avatar Sep 26 '22 13:09 ascillitoe

Hi! We've released thinc 8.1.2 today which restricts blis to <0.8.0 again instead of <0.10.0. Thanks again for your report!

svlandeg avatar Sep 27 '22 14:09 svlandeg

Thanks for the update @svlandeg!

ascillitoe avatar Sep 27 '22 14:09 ascillitoe

@ascillitoe Could you reproduce the crash with the following env variable and post the stdout output of the run: BLIS_ARCH_DEBUG=1? This will help us narrow down which BLIS kernel is faulty - you should something along the lines of libblis: selecting sub-configuration 'haswell'. in the log.

shadeMe avatar Sep 30 '22 12:09 shadeMe

@ascillitoe Could you reproduce the crash with the following env variable and post the stdout output of the run: BLIS_ARCH_DEBUG=1? This will help us narrow down which BLIS kernel is faulty - you should something along the lines of libblis: selecting sub-configuration 'haswell'. in the log.

Sure thing, I'll have a go!

ascillitoe avatar Sep 30 '22 13:09 ascillitoe

Hi again. I've set some more CI runs going here (508 to 518).

Interesting, it looks like all the runs that fail (like this one) have:

libblis: selecting sub-configuration 'haswell'.

whereas the runs that pass (like this one) have:

libblis: Hardware has 2 FMA units; using 'skx' sub-config.
libblis: selecting sub-configuration 'skx'.

So it looks like it might be the haswell kernels?

ascillitoe avatar Sep 30 '22 14:09 ascillitoe

Thanks for the quick response! Yeah, the OOB access is in one of the haswell kernels, which confirms our suspicion.

shadeMe avatar Sep 30 '22 14:09 shadeMe