Segmentation fault on RHEL8
Hi,
I am experiencing segmentation faults when running large number of regression model building/prediction cycles in a loop. Those appear seemingly at random - running the loop over the same data will break at different iterations. Using PYTHONFAULTHANDLER=1 I was able to obtain a trace (see below) which suggests that the segmentation fault is caused by the call to x = torch.nn.functional.gelu(x). I was able to confirm that at runtime but executing this line in isolation with the a "bad" tensor does not fail. There is a (closed) report for gelu causing segmentation fault here https://github.com/pytorch/pytorch/issues/78152 but it does not seem relevant. Any ideas/recommendation about the possible cause and potential solution will be appreciated, thanks!
Segmentation fault traceback
Fatal Python error: Segmentation fault
Thread 0x00007f33c752e700 (most recent call first):
File "/envs/tabpfn311/lib/python3.11/threading.py", line 331 in wait
File "/envs/tabpfn311/lib/python3.11/threading.py", line 629 in wait
File "/envs/tabpfn311/lib/python3.11/site-packages/tqdm/_monitor.py", line 60 in run
File "/envs/tabpfn311/lib/python3.11/threading.py", line 1045 in _bootstrap_inner
File "/envs/tabpfn311/lib/python3.11/threading.py", line 1002 in _bootstrap
Thread 0x00007f33c802f700 (most recent call first):
File "/envs/tabpfn311/lib/python3.11/concurrent/futures/thread.py", line 81 in _worker
File "/envs/tabpfn311/lib/python3.11/threading.py", line 982 in run
File "/envs/tabpfn311/lib/python3.11/threading.py", line 1045 in _bootstrap_inner
File "/envs/tabpfn311/lib/python3.11/threading.py", line 1002 in _bootstrap
Current thread 0x00007f35ec87e400 (most recent call first):
File "/dev/TabPFN/src/tabpfn/model/mlp.py", line 97 in _compute
File "/dev/TabPFN/src/tabpfn/model/memory.py", line 100 in method_
File "/dev/TabPFN/src/tabpfn/model/mlp.py", line 132 in forward
File "/envs/tabpfn311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750 in _call_impl
File "/envs/tabpfn311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739 in _wrapped_call_impl
File "/dev/TabPFN/src/tabpfn/model/layer.py", line 449 in forward
File "/envs/tabpfn311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750 in _call_impl
File "/envs/tabpfn311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739 in _wrapped_call_impl
File "/dev/TabPFN/src/tabpfn/model/transformer.py", line 89 in forward
File "/envs/tabpfn311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750 in _call_impl
File "/envs/tabpfn311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739 in _wrapped_call_impl
File "/dev/TabPFN/src/tabpfn/model/transformer.py", line 628 in _forward
File "/dev/TabPFN/src/tabpfn/model/transformer.py", line 416 in forward
File "/envs/tabpfn311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750 in _call_impl
File "/envs/tabpfn311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739 in _wrapped_call_impl
File "/dev/TabPFN/src/tabpfn/inference.py", line 327 in iter_outputs
File "/dev/TabPFN/src/tabpfn/regressor.py", line 624 in predict
File "/dev/TabPFN/pfn_loop.py", line 255 in main
File "/dev/TabPFN/pfn_loop.py", line 300 in <module>
OS: Red Hat Enterprise Linux 8.10 (Ootpa)
GPU: NVIDIA A40
Driver Version: 545.23.08
CUDA Version: 12.3
Python: 3.10 or 3.11 (fails with both)
torch==2.6.0
nvidia-cuda-runtime-cu12==12.4.127
TabPFN.git@634efcda5d545bff6740ece97fb66e74ee6df08c
@dhristozov Can you provide the script to reproduce this ?
I have the same error: Segmentation fault, when I run it in Mac. In Windows, it crashes. When I run inference using the same code in the documentation:
predictions = clf.predict(x_test)
print("Accuracy", accuracy_score(y_test, predictions))
My feature number is 189 to predict 3 classes. The test set size is 400.
Thanks @BaosenZ for the report! Would you be able to provide a minimal reproducible example? It would help us a lot with reproducing and fixing this issue. If you could also run import tabpfn; tabpfn.display_debug_info(), it would also be helpful!
@LeoGrin For these problems, I think one improvement for reproducibility would to be suggest uv script feature, where one file includes code to run along with all the dependencies (including python version) in just one place, see here
Curious to hear your thoughts on how this could be applied to this repo.
@mert-kurttutan thanks for the response, I'll try to create an example I can share, the random nature of the segfaults makes it hard as the same input will happily run on repeat.
@dhristozov To get proper stacktrace info from C extension side of pytorch, you better use, gdb to debug it.
Here is some pointer to get you started: https://developers.redhat.com/articles/2021/09/08/debugging-python-c-extensions-gdb#debugging_with_gbd_in_python_3_9
In particular you should look at where the c extension leads to before segfault, it might be cudnn binaries.
@mert-kurttutan thanks for the response, I'll try to create an example I can share, the random nature of the segfaults makes it hard as the same input will happily run on repeat.
Regarding randomness, you can follow up the points here: https://www.geeksforgeeks.org/reproducibility-in-pytorch/
Also, if you have to time as a last resort, you can try make cuda run synchronously, torch.cuda.synchronize(). But, this is tricky to run correctly, probably wont be as important for debugging.
Closing due to unresponsiveness