TabPFN icon indicating copy to clipboard operation
TabPFN copied to clipboard

Segmentation fault on RHEL8

Open dhristozov opened this issue 10 months ago • 7 comments

Hi,

I am experiencing segmentation faults when running large number of regression model building/prediction cycles in a loop. Those appear seemingly at random - running the loop over the same data will break at different iterations. Using PYTHONFAULTHANDLER=1 I was able to obtain a trace (see below) which suggests that the segmentation fault is caused by the call to x = torch.nn.functional.gelu(x). I was able to confirm that at runtime but executing this line in isolation with the a "bad" tensor does not fail. There is a (closed) report for gelu causing segmentation fault here https://github.com/pytorch/pytorch/issues/78152 but it does not seem relevant. Any ideas/recommendation about the possible cause and potential solution will be appreciated, thanks!

Segmentation fault traceback

Fatal Python error: Segmentation fault

Thread 0x00007f33c752e700 (most recent call first):
  File "/envs/tabpfn311/lib/python3.11/threading.py", line 331 in wait
  File "/envs/tabpfn311/lib/python3.11/threading.py", line 629 in wait
  File "/envs/tabpfn311/lib/python3.11/site-packages/tqdm/_monitor.py", line 60 in run
  File "/envs/tabpfn311/lib/python3.11/threading.py", line 1045 in _bootstrap_inner
  File "/envs/tabpfn311/lib/python3.11/threading.py", line 1002 in _bootstrap

Thread 0x00007f33c802f700 (most recent call first):
  File "/envs/tabpfn311/lib/python3.11/concurrent/futures/thread.py", line 81 in _worker
  File "/envs/tabpfn311/lib/python3.11/threading.py", line 982 in run
  File "/envs/tabpfn311/lib/python3.11/threading.py", line 1045 in _bootstrap_inner
  File "/envs/tabpfn311/lib/python3.11/threading.py", line 1002 in _bootstrap

Current thread 0x00007f35ec87e400 (most recent call first):
  File "/dev/TabPFN/src/tabpfn/model/mlp.py", line 97 in _compute
  File "/dev/TabPFN/src/tabpfn/model/memory.py", line 100 in method_
  File "/dev/TabPFN/src/tabpfn/model/mlp.py", line 132 in forward
  File "/envs/tabpfn311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750 in _call_impl
  File "/envs/tabpfn311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739 in _wrapped_call_impl
  File "/dev/TabPFN/src/tabpfn/model/layer.py", line 449 in forward
  File "/envs/tabpfn311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750 in _call_impl
  File "/envs/tabpfn311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739 in _wrapped_call_impl
  File "/dev/TabPFN/src/tabpfn/model/transformer.py", line 89 in forward
  File "/envs/tabpfn311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750 in _call_impl
  File "/envs/tabpfn311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739 in _wrapped_call_impl
  File "/dev/TabPFN/src/tabpfn/model/transformer.py", line 628 in _forward
  File "/dev/TabPFN/src/tabpfn/model/transformer.py", line 416 in forward
  File "/envs/tabpfn311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750 in _call_impl
  File "/envs/tabpfn311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739 in _wrapped_call_impl
  File "/dev/TabPFN/src/tabpfn/inference.py", line 327 in iter_outputs
  File "/dev/TabPFN/src/tabpfn/regressor.py", line 624 in predict
  File "/dev/TabPFN/pfn_loop.py", line 255 in main
  File "/dev/TabPFN/pfn_loop.py", line 300 in <module>
OS: Red Hat Enterprise Linux 8.10 (Ootpa)
GPU: NVIDIA A40
Driver Version: 545.23.08 
CUDA Version: 12.3
Python: 3.10 or 3.11 (fails with both)
torch==2.6.0
nvidia-cuda-runtime-cu12==12.4.127
TabPFN.git@634efcda5d545bff6740ece97fb66e74ee6df08c

dhristozov avatar Feb 18 '25 18:02 dhristozov

@dhristozov Can you provide the script to reproduce this ?

mert-kurttutan avatar Feb 26 '25 14:02 mert-kurttutan

I have the same error: Segmentation fault, when I run it in Mac. In Windows, it crashes. When I run inference using the same code in the documentation:

predictions = clf.predict(x_test)
print("Accuracy", accuracy_score(y_test, predictions))

My feature number is 189 to predict 3 classes. The test set size is 400.

BaosenZ avatar Mar 01 '25 18:03 BaosenZ

Thanks @BaosenZ for the report! Would you be able to provide a minimal reproducible example? It would help us a lot with reproducing and fixing this issue. If you could also run import tabpfn; tabpfn.display_debug_info(), it would also be helpful!

LeoGrin avatar Mar 03 '25 16:03 LeoGrin

@LeoGrin For these problems, I think one improvement for reproducibility would to be suggest uv script feature, where one file includes code to run along with all the dependencies (including python version) in just one place, see here

Curious to hear your thoughts on how this could be applied to this repo.

mert-kurttutan avatar Mar 03 '25 20:03 mert-kurttutan

@mert-kurttutan thanks for the response, I'll try to create an example I can share, the random nature of the segfaults makes it hard as the same input will happily run on repeat.

dhristozov avatar Mar 03 '25 20:03 dhristozov

@dhristozov To get proper stacktrace info from C extension side of pytorch, you better use, gdb to debug it.

Here is some pointer to get you started: https://developers.redhat.com/articles/2021/09/08/debugging-python-c-extensions-gdb#debugging_with_gbd_in_python_3_9

In particular you should look at where the c extension leads to before segfault, it might be cudnn binaries.

mert-kurttutan avatar Mar 03 '25 20:03 mert-kurttutan

@mert-kurttutan thanks for the response, I'll try to create an example I can share, the random nature of the segfaults makes it hard as the same input will happily run on repeat.

Regarding randomness, you can follow up the points here: https://www.geeksforgeeks.org/reproducibility-in-pytorch/

Also, if you have to time as a last resort, you can try make cuda run synchronously, torch.cuda.synchronize(). But, this is tricky to run correctly, probably wont be as important for debugging.

mert-kurttutan avatar Mar 03 '25 21:03 mert-kurttutan

Closing due to unresponsiveness

noahho avatar Jun 25 '25 20:06 noahho