nanoGPT
nanoGPT copied to clipboard
Signal: Segmentation fault
Getting this segmentation fault when running train.py
:
[129-213-18-253:70544] *** Process received signal ***
[129-213-18-253:70544] Signal: Segmentation fault (11)
[129-213-18-253:70544] Signal code: Address not mapped (1)
[129-213-18-253:70544] Failing at address: 0xb120f18
Traceback (most recent call last):
File "train.py", line 276, in <module>
loss.backward()
File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/_tensor.py", line 488, in backward
torch.autograd.backward(
File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/autograd/__init__.py", line 197, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/autograd/function.py", line 275, in apply
return user_fn(self, *args)
File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/_functorch/aot_autograd.py", line 1905, in backward
out = call_compiled_backward()
File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/_functorch/aot_autograd.py", line 1876, in call_compiled_backward
CompiledFunction.compiled_bw = aot_config.bw_compiler(
File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/_dynamo/optimizations/training.py", line 64, in _wrapped_bw_compiler
return eval_frame.disable(eval_frame.disable(bw_compiler)(*args, **kwargs))
File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/_dynamo/eval_frame.py", line 211, in _fn
return fn(*args, **kwargs)
File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/_dynamo/utils.py", line 160, in time_wrapper
r = func(*args, **kwargs)
File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/_inductor/compile_fx.py", line 399, in bw_compiler
return inner_compile(
File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/_dynamo/debug_utils.py", line 586, in debug_wrapper
compiled_fn = compiler_fn(gm, example_inputs, **kwargs)
File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/_inductor/debug.py", line 239, in inner
return fn(*args, **kwargs)
File "/usr/lib/python3.8/contextlib.py", line 75, in inner
return func(*args, **kwds)
File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/_inductor/compile_fx.py", line 151, in compile_fx_inner
compiled_fn = graph.compile_to_fn()
File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/_inductor/graph.py", line 560, in compile_to_fn
return self.compile_to_module().call
File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/_dynamo/utils.py", line 160, in time_wrapper
r = func(*args, **kwargs)
File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/_inductor/graph.py", line 549, in compile_to_module
mod = PyCodeCache.load(code)
File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/_inductor/codecache.py", line 504, in load
exec(code, mod.__dict__, mod.__dict__)
File "/tmp/torchinductor_ubuntu/of/cofcojqn7s4ev5onzwbpxe6r77y2zy2yz7jxxqdpokdb3cojn2yb.py", line 1417, in <module>
async_compile.wait(globals())
File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/_inductor/codecache.py", line 691, in wait
scope[key] = result.result()
File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/_inductor/codecache.py", line 549, in result
self.future.result()
File "/usr/lib/python3.8/concurrent/futures/_base.py", line 444, in result
There are also conflicts with numpy versions. It had me forcing installing numpy 1.21 instead of 1.24 after getting:
AttributeError: module 'numpy' has no attribute 'typeDict'
I don't know if this is because I prepared the data using Pytorch 1.4 and then switch to Pytorch 2 only in the training stage.
Well it's not because I prepared it using a different Pytorch version. It just fails. Any thoughts?
I distilled the issue to the fact that I tried running the training on 1 GPU. It works with more GPU's for some reason.