tinygrad
tinygrad copied to clipboard
Use SUPPORT_BF16=0 with llama3.py leading to Segmentation fault
Example:
SUPPORT_BF16=0 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6 python3 examples/llama3.py --download_model --shard 7 --size 8B
seed = 1730782018
0%| | 0/292 [00:00<?, ?it/s]
Segmentation fault
Flag tried to use to avoid another issue:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6 python3 examples/llama3.py --download_model --shard 7 --size 8B
seed = 1730778505
0%| | 0/292 [00:00<?, ?it/s]loaded weights in 315.08 ms, 0.03 GB loaded at 0.11 GB/s
Traceback (most recent call last):
File "/home/user/tinygrad/examples/llama3.py", line 263, in <module>
model = build_transformer(args.model, model_size=args.size, quantize=args.quantize, device=device)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/tinygrad/examples/llama3.py", line 186, in build_transformer
load_state_dict(model, weights, strict=False, consume=True)
File "/home/user/tinygrad/tinygrad/nn/state.py", line 128, in load_state_dict
else: v.replace(state_dict[k].shard(mlb.device, mlb.axis)).realize()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/tinygrad/tinygrad/tensor.py", line 3602, in _wrapper
ret = fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/user/tinygrad/tinygrad/tensor.py", line 223, in realize
run_schedule(*self.schedule_with_vars(*lst), do_update_stats=do_update_stats)
File "/home/user/tinygrad/tinygrad/engine/realize.py", line 214, in run_schedule
for ei in lower_schedule(schedule):
File "/home/user/tinygrad/tinygrad/engine/realize.py", line 207, in lower_schedule
raise e
File "/home/user/tinygrad/tinygrad/engine/realize.py", line 201, in lower_schedule
try: yield lower_schedule_item(si)
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/tinygrad/tinygrad/engine/realize.py", line 186, in lower_schedule_item
runner = get_runner(si.outputs[0].device, si.ast)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/tinygrad/tinygrad/engine/realize.py", line 154, in get_runner
method_cache[ckey] = method_cache[bkey] = ret = CompiledRunner(replace(prg, dname=dname))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/tinygrad/tinygrad/engine/realize.py", line 81, in __init__
self.lib:bytes = precompiled if precompiled is not None else Device[p.dname].compiler.compile_cached(p.src)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/tinygrad/tinygrad/device.py", line 187, in compile_cached
lib = self.compile(src)
^^^^^^^^^^^^^^^^^
File "/home/user/tinygrad/tinygrad/runtime/support/compiler_cuda.py", line 60, in compile
def compile(self, src:str) -> bytes: return self._compile_program(src, nvrtc.nvrtcGetPTX, nvrtc.nvrtcGetPTXSize)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/tinygrad/tinygrad/runtime/support/compiler_cuda.py", line 56, in _compile_program
nvrtc_check(nvrtc.nvrtcCompileProgram(prog, len(self.compile_options), to_char_p_p([o.encode() for o in self.compile_options])), prog)
File "/home/user/tinygrad/tinygrad/runtime/support/compiler_cuda.py", line 16, in nvrtc_check
raise CompileError(f"Nvrtc Error {status}, {ctypes.string_at(nvrtc.nvrtcGetErrorString(status)).decode()}\n{err_log}")
tinygrad.device.CompileError: Nvrtc Error 6, NVRTC_ERROR_COMPILATION
<null>(12): error: more than one user-defined conversion from "nv_bfloat16" to "half" applies:
function "__half::__half(float)"
/usr/include/cuda_fp16.hpp(201): here
function "__half::__half(short)"
/usr/include/cuda_fp16.hpp(214): here
function "__half::__half(unsigned short)"
/usr/include/cuda_fp16.hpp(215): here
function "__half::__half(int)"
/usr/include/cuda_fp16.hpp(216): here
function "__half::__half(unsigned int)"
/usr/include/cuda_fp16.hpp(217): here
function "__half::__half(long long)"
/usr/include/cuda_fp16.hpp(218): here
function "__half::__half(unsigned long long)"
/usr/include/cuda_fp16.hpp(219): here
1 error detected in the compilation of "<null>".