nanoGPT icon indicating copy to clipboard operation
nanoGPT copied to clipboard

Signal: Segmentation fault

Open tombenj opened this issue 2 years ago • 2 comments

Getting this segmentation fault when running train.py:

[129-213-18-253:70544] *** Process received signal ***
[129-213-18-253:70544] Signal: Segmentation fault (11)
[129-213-18-253:70544] Signal code: Address not mapped (1)
[129-213-18-253:70544] Failing at address: 0xb120f18
Traceback (most recent call last):
  File "train.py", line 276, in <module>
    loss.backward()
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/_tensor.py", line 488, in backward
    torch.autograd.backward(
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/autograd/function.py", line 275, in apply
    return user_fn(self, *args)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/_functorch/aot_autograd.py", line 1905, in backward
    out = call_compiled_backward()
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/_functorch/aot_autograd.py", line 1876, in call_compiled_backward
    CompiledFunction.compiled_bw = aot_config.bw_compiler(
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/_dynamo/optimizations/training.py", line 64, in _wrapped_bw_compiler
    return eval_frame.disable(eval_frame.disable(bw_compiler)(*args, **kwargs))
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/_dynamo/eval_frame.py", line 211, in _fn
    return fn(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/_dynamo/utils.py", line 160, in time_wrapper
    r = func(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/_inductor/compile_fx.py", line 399, in bw_compiler
    return inner_compile(
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/_dynamo/debug_utils.py", line 586, in debug_wrapper
    compiled_fn = compiler_fn(gm, example_inputs, **kwargs)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/_inductor/debug.py", line 239, in inner
    return fn(*args, **kwargs)
  File "/usr/lib/python3.8/contextlib.py", line 75, in inner
    return func(*args, **kwds)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/_inductor/compile_fx.py", line 151, in compile_fx_inner
    compiled_fn = graph.compile_to_fn()
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/_inductor/graph.py", line 560, in compile_to_fn
    return self.compile_to_module().call
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/_dynamo/utils.py", line 160, in time_wrapper
    r = func(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/_inductor/graph.py", line 549, in compile_to_module
    mod = PyCodeCache.load(code)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/_inductor/codecache.py", line 504, in load
    exec(code, mod.__dict__, mod.__dict__)
  File "/tmp/torchinductor_ubuntu/of/cofcojqn7s4ev5onzwbpxe6r77y2zy2yz7jxxqdpokdb3cojn2yb.py", line 1417, in <module>
    async_compile.wait(globals())
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/_inductor/codecache.py", line 691, in wait
    scope[key] = result.result()
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/_inductor/codecache.py", line 549, in result
    self.future.result()
  File "/usr/lib/python3.8/concurrent/futures/_base.py", line 444, in result

There are also conflicts with numpy versions. It had me forcing installing numpy 1.21 instead of 1.24 after getting: AttributeError: module 'numpy' has no attribute 'typeDict'

I don't know if this is because I prepared the data using Pytorch 1.4 and then switch to Pytorch 2 only in the training stage.

tombenj avatar Jan 30 '23 09:01 tombenj

Well it's not because I prepared it using a different Pytorch version. It just fails. Any thoughts?

tombenj avatar Feb 03 '23 10:02 tombenj

I distilled the issue to the fact that I tried running the training on 1 GPU. It works with more GPU's for some reason.

tombenj avatar Feb 04 '23 18:02 tombenj