tinygrad Use SUPPORT_BF16=0 with llama3.py leading to Segmentation fault

Use SUPPORT_BF16=0 with llama3.py leading to Segmentation fault

Open FFAMax opened this issue 3 months ago • 3 comments

Example:

SUPPORT_BF16=0  CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6 python3 examples/llama3.py --download_model --shard 7 --size 8B
seed = 1730782018
  0%|   | 0/292 [00:00<?, ?it/s]
Segmentation fault

Flag tried to use to avoid another issue:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6 python3 examples/llama3.py --download_model --shard 7 --size 8B
seed = 1730778505
  0%|                                                                                                                                                                                                                                                                 | 0/292 [00:00<?, ?it/s]loaded weights in 315.08 ms, 0.03 GB loaded at 0.11 GB/s
Traceback (most recent call last):
  File "/home/user/tinygrad/examples/llama3.py", line 263, in <module>
    model = build_transformer(args.model, model_size=args.size, quantize=args.quantize, device=device)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/tinygrad/examples/llama3.py", line 186, in build_transformer
    load_state_dict(model, weights, strict=False, consume=True)
  File "/home/user/tinygrad/tinygrad/nn/state.py", line 128, in load_state_dict
    else: v.replace(state_dict[k].shard(mlb.device, mlb.axis)).realize()
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/tinygrad/tinygrad/tensor.py", line 3602, in _wrapper
    ret = fn(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^
  File "/home/user/tinygrad/tinygrad/tensor.py", line 223, in realize
    run_schedule(*self.schedule_with_vars(*lst), do_update_stats=do_update_stats)
  File "/home/user/tinygrad/tinygrad/engine/realize.py", line 214, in run_schedule
    for ei in lower_schedule(schedule):
  File "/home/user/tinygrad/tinygrad/engine/realize.py", line 207, in lower_schedule
    raise e
  File "/home/user/tinygrad/tinygrad/engine/realize.py", line 201, in lower_schedule
    try: yield lower_schedule_item(si)
               ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/tinygrad/tinygrad/engine/realize.py", line 186, in lower_schedule_item
    runner = get_runner(si.outputs[0].device, si.ast)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/tinygrad/tinygrad/engine/realize.py", line 154, in get_runner
    method_cache[ckey] = method_cache[bkey] = ret = CompiledRunner(replace(prg, dname=dname))
                                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/tinygrad/tinygrad/engine/realize.py", line 81, in __init__
    self.lib:bytes = precompiled if precompiled is not None else Device[p.dname].compiler.compile_cached(p.src)
                                                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/tinygrad/tinygrad/device.py", line 187, in compile_cached
    lib = self.compile(src)
          ^^^^^^^^^^^^^^^^^
  File "/home/user/tinygrad/tinygrad/runtime/support/compiler_cuda.py", line 60, in compile
    def compile(self, src:str) -> bytes: return self._compile_program(src, nvrtc.nvrtcGetPTX, nvrtc.nvrtcGetPTXSize)
                                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/tinygrad/tinygrad/runtime/support/compiler_cuda.py", line 56, in _compile_program
    nvrtc_check(nvrtc.nvrtcCompileProgram(prog, len(self.compile_options), to_char_p_p([o.encode() for o in self.compile_options])), prog)
  File "/home/user/tinygrad/tinygrad/runtime/support/compiler_cuda.py", line 16, in nvrtc_check
    raise CompileError(f"Nvrtc Error {status}, {ctypes.string_at(nvrtc.nvrtcGetErrorString(status)).decode()}\n{err_log}")
tinygrad.device.CompileError: Nvrtc Error 6, NVRTC_ERROR_COMPILATION
<null>(12): error: more than one user-defined conversion from "nv_bfloat16" to "half" applies:
            function "__half::__half(float)"
/usr/include/cuda_fp16.hpp(201): here
            function "__half::__half(short)"
/usr/include/cuda_fp16.hpp(214): here
            function "__half::__half(unsigned short)"
/usr/include/cuda_fp16.hpp(215): here
            function "__half::__half(int)"
/usr/include/cuda_fp16.hpp(216): here
            function "__half::__half(unsigned int)"
/usr/include/cuda_fp16.hpp(217): here
            function "__half::__half(long long)"
/usr/include/cuda_fp16.hpp(218): here
            function "__half::__half(unsigned long long)"
/usr/include/cuda_fp16.hpp(219): here

1 error detected in the compilation of "<null>".

Nov 05 '24 04:11 FFAMax

tinygrad tinygrad copied to clipboard

Use SUPPORT_BF16=0 with llama3.py leading to Segmentation fault

tinygrad
tinygrad copied to clipboard