nanoGPT
nanoGPT copied to clipboard
Error when I try nanoGPT on GPUs from Kaggle
Ref: https://www.kaggle.com/code/bkowshik/karpathy-nanogpt/
Trying to get nanoGPT up and running on GPUs provided on Kaggle. Get the following errors:
GPU - P100
$ python [train.py](http://train.py/) config/train_shakespeare_char.py
RuntimeError: Found Tesla P100-PCIE-16GB which is too old to be supported by the triton GPU compiler, which is used as the backend. Triton only supports devices of CUDA Capability >= 7.0, but your device is of CUDA capability 6.0
GPU - T4 x 2
Dependencies
# Version of PyTorch.
>>> import torch
>>> print(torch.__version__)
2.0.0
# Version of CUDA.
!nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
Error
[2023-07-06 05:02:42,006] torch._inductor.graph: [INFO] Using FallbackKernel: torch.ops.aten._scaled_dot_product_flash_attention.default
/usr/bin/ld/usr/bin/ld: cannot find -lcuda: No such file or directory
: cannot find -lcuda: No such file or directory
collect2: error: ld returned 1 exit status
subprocess.CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmprh0gq_rd/main.c', '-O3', '-I/usr/local/cuda/include', '-I/opt/conda/include/python3.10', '-I/tmp/tmprh0gq_rd', '-shared', '-fPIC', '-lcuda', '-o', '/tmp/tmprh0gq_rd/triton_.cpython-310-x86_64-linux-gnu.so']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
/usr/bin/ld: cannot find -lcuda: No such file or directory
collect2: error: ld returned 1 exit status
The full error message is here:
[2023-07-06 08:50:19,371] torch._inductor.graph: [INFO] Using FallbackKernel: torch.ops.aten._scaled_dot_product_flash_attention.default
/usr/bin/ld: cannot find -lcuda: No such file or directory
/usr/bin/ld: cannot find -lcuda: No such file or directory
collect2: error: ld returned 1 exit status
collect2: error: ld returned 1 exit status
concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 549, in _worker_compile
kernel.precompile(warm_cache_only_with_cc=cc)
File "/opt/conda/lib/python3.10/site-packages/torch/_inductor/triton_ops/autotune.py", line 69, in precompile
self.launchers = [
File "/opt/conda/lib/python3.10/site-packages/torch/_inductor/triton_ops/autotune.py", line 70, in <listcomp>
self._precompile_config(c, warm_cache_only_with_cc)
File "/opt/conda/lib/python3.10/site-packages/torch/_inductor/triton_ops/autotune.py", line 83, in _precompile_config
triton.compile(
File "/opt/conda/lib/python3.10/site-packages/triton/compiler.py", line 1588, in compile
so_path = make_stub(name, signature, constants)
File "/opt/conda/lib/python3.10/site-packages/triton/compiler.py", line 1477, in make_stub
so = _build(name, src_path, tmpdir)
File "/opt/conda/lib/python3.10/site-packages/triton/compiler.py", line 1392, in _build
ret = subprocess.check_call(cc_cmd)
File "/opt/conda/lib/python3.10/subprocess.py", line 369, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmpma68p_sy/main.c', '-O3', '-I/usr/local/cuda/include', '-I/opt/conda/include/python3.10', '-I/tmp/tmpma68p_sy', '-shared', '-fPIC', '-lcuda', '-o', '/tmp/tmpma68p_sy/triton_.cpython-310-x86_64-linux-gnu.so']' returned non-zero exit status 1.
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 670, in call_user_compiler
compiled_fn = compiler_fn(gm, self.fake_example_inputs())
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/debug_utils.py", line 1055, in debug_wrapper
compiled_gm = compiler_fn(gm, example_inputs)
File "/opt/conda/lib/python3.10/site-packages/torch/__init__.py", line 1390, in __call__
return compile_fx(model_, inputs_, config_patches=self.config)
File "/opt/conda/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 455, in compile_fx
return aot_autograd(
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/backends/common.py", line 48, in compiler_fn
cg = aot_module_simplified(gm, example_inputs, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 2805, in aot_module_simplified
compiled_fn = create_aot_dispatcher_function(
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 163, in time_wrapper
r = func(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 2498, in create_aot_dispatcher_function
compiled_fn = compiler_fn(flat_fn, fake_flat_args, aot_config)
File "/opt/conda/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 1713, in aot_wrapper_dedupe
return compiler_fn(flat_fn, leaf_flat_args, aot_config)
File "/opt/conda/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 1326, in aot_dispatch_base
compiled_fw = aot_config.fw_compiler(fw_module, flat_args_with_views_handled)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 163, in time_wrapper
r = func(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 430, in fw_compiler
return inner_compile(
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/debug_utils.py", line 595, in debug_wrapper
compiled_fn = compiler_fn(gm, example_inputs)
File "/opt/conda/lib/python3.10/site-packages/torch/_inductor/debug.py", line 239, in inner
return fn(*args, **kwargs)
File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/opt/conda/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 177, in compile_fx_inner
compiled_fn = graph.compile_to_fn()
File "/opt/conda/lib/python3.10/site-packages/torch/_inductor/graph.py", line 586, in compile_to_fn
return self.compile_to_module().call
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 163, in time_wrapper
r = func(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/_inductor/graph.py", line 575, in compile_to_module
mod = PyCodeCache.load(code)
File "/opt/conda/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 528, in load
exec(code, mod.__dict__, mod.__dict__)
File "/tmp/torchinductor_root/ut/cutluoytkpipzjefaid23t4u36owtuzeuzqqktx6wnp6je4gjbjx.py", line 827, in <module>
async_compile.wait(globals())
File "/opt/conda/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 715, in wait
scope[key] = result.result()
File "/opt/conda/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 573, in result
self.future.result()
File "/opt/conda/lib/python3.10/concurrent/futures/_base.py", line 458, in result
return self.__get_result()
File "/opt/conda/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
raise self._exception
subprocess.CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmpma68p_sy/main.c', '-O3', '-I/usr/local/cuda/include', '-I/opt/conda/include/python3.10', '-I/tmp/tmpma68p_sy', '-shared', '-fPIC', '-lcuda', '-o', '/tmp/tmpma68p_sy/triton_.cpython-310-x86_64-linux-gnu.so']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/kaggle/working/nanoGPT/train.py", line 261, in <module>
losses = estimate_loss()
File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/kaggle/working/nanoGPT/train.py", line 221, in estimate_loss
logits, loss = model(X, Y)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 82, in forward
return self.dynamo_ctx(self._orig_mod.forward)(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 209, in _fn
return fn(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 337, in catch_errors
return callback(frame, cache_size, hooks)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 404, in _convert_frame
result = inner_convert(frame, cache_size, hooks)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 104, in _fn
return fn(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 262, in _convert_frame_assert
return _compile(
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 163, in time_wrapper
r = func(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 324, in _compile
out_code = transform_code_object(code, transform)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/bytecode_transformation.py", line 445, in transform_code_object
transformations(instructions, code_options)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 311, in transform
tracer.run()
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 1726, in run
super().run()
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 576, in run
and self.step()
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 540, in step
getattr(self, inst.opname)(inst)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 1792, in RETURN_VALUE
self.output.compile_subgraph(
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 541, in compile_subgraph
self.compile_and_call_fx_graph(tx, pass2.graph_output_vars(), root)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 588, in compile_and_call_fx_graph
compiled_fn = self.call_user_compiler(gm)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 163, in time_wrapper
r = func(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 675, in call_user_compiler
raise BackendCompilerFailed(self.compiler_fn, e) from e
torch._dynamo.exc.BackendCompilerFailed: debug_wrapper raised CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmpma68p_sy/main.c', '-O3', '-I/usr/local/cuda/include', '-I/opt/conda/include/python3.10', '-I/tmp/tmpma68p_sy', '-shared', '-fPIC', '-lcuda', '-o', '/tmp/tmpma68p_sy/triton_.cpython-310-x86_64-linux-gnu.so']' returned non-zero exit status 1.
Set torch._dynamo.config.verbose=True for more information
You can suppress this exception and fall back to eager by setting:
torch._dynamo.config.suppress_errors = True
/usr/bin/ld: cannot find -lcuda: No such file or directory
collect2: error: ld returned 1 exit status
Did you fix it? If not, I think its an error with torch.compile()
. Try setting Compile = False
and try again. I have a simplified, working implementation of NanoGPT, on Kaggle, based on the original lecture.