exo icon indicating copy to clipboard operation
exo copied to clipboard

CUDA Error 2, out of memory

Open fangxuezheng opened this issue 1 year ago • 14 comments

What is this memory overflow related to? My graphics card has 8g of video memory, so it's impossible to load the model until it reaches 8% and then terminate? Can you help analyze the reason? The following are graphics card information and error messages? thank you image image image

fangxuezheng avatar Jul 30 '24 08:07 fangxuezheng

Hey, thanks for the detailed issue.

can you run with DEBUG=2 and send the logs here?

AlexCheema avatar Jul 30 '24 08:07 AlexCheema

Can you make sure you have the latest NVIDIA drivers / CUDA toolkit installed too?

AlexCheema avatar Jul 30 '24 08:07 AlexCheema

Now the model can be loaded again, but I feel like it sometimes works and sometimes it doesn't. Moreover, after loading the model, I keep running data and chatting on tinychat, but the response is particularly slow and there are very few words given。 And Why is this computing tflops value always 0? I am using Windows Wsl2 ubantu20.04 here image image

fangxuezheng avatar Jul 30 '24 08:07 fangxuezheng

Can you make sure you have the latest NVIDIA drivers / CUDA toolkit installed too?

Isn't this command used for the NVIDIA drivers/CUDA toolkit version? nvidia-smi And NVCC-V?

fangxuezheng avatar Jul 30 '24 08:07 fangxuezheng

Good to know it at least works in WSL. OPENCL is always going to be quite slow. You’ll want to configure your NVIDIA drivers so that your GPU is detected properly and uses the CUDA backend

AlexCheema avatar Jul 30 '24 08:07 AlexCheema

Good to know it at least works in WSL. OPENCL is always going to be quite slow. You’ll want to configure your NVIDIA drivers so that your GPU is detected properly and uses the CUDA backend In my previous screenshots, both my WSL ubantu NVIDIA drivers and CUDA toolkit are 12.5,Are these reasons all related to the NVIDIA drivers/CUDA toolkit?

fangxuezheng avatar Jul 30 '24 08:07 fangxuezheng

I just bumped up the tinygrad version https://github.com/exo-explore/exo/commit/142682645f2c8b480e1105c1d8c2dc0a9b767815, since it was quite old. Can you try with the latest version?

AlexCheema avatar Jul 30 '24 09:07 AlexCheema

I just bumped up the tinygrad version 1426826, since it was quite old. Can you try with the latest version?

These questions still exist, thank you

fangxuezheng avatar Jul 30 '24 09:07 fangxuezheng

I'm running into this issue on my WSL2 instance in Windows 10 also. I think it has to do with a limitation in WSL around using pinned system memory: https://docs.nvidia.com/cuda/wsl-user-guide/index.html#known-limitations-for-linux-cuda-apps

I'm assuming that Tinygrad would need to implement a way to control if pinned memory is used or not. Looks like the llama.cpp folks implemented something like that as a workaround: https://github.com/ggerganov/llama.cpp/issues/1230

I think this is the issue in WSL (which is marked as closed but doesn't seem to be fixed): https://github.com/microsoft/WSL/issues/8447

pickettd avatar Oct 04 '24 16:10 pickettd

is there a solution to this problem? Did someone try QEMU/KVM with GPU passthrough as it is a powerful way to run virtual machines with direct access to your GPU

Shivp1413 avatar Oct 14 '24 13:10 Shivp1413

is there a solution to this problem? Did someone try QEMU/KVM with GPU passthrough as it is a powerful way to run virtual machines with direct access to your GPU

That is a good idea (trying a different VM/Hypervisor than the WSL approach since the issue seems to be in WSL). My plan for a workaround is to dual-boot to Ubuntu but I haven't gotten around to it yet.

I think it could be a reasonable idea to ask the Tinygrad folks if there is a config flag to not use pinned memory (since I think that is the way llama.cpp got around the limitation) - but I don't think they have a GitHub issue related to this yet

pickettd avatar Oct 14 '24 15:10 pickettd

Wanted to post some updates here just in case other people are in the same situation.

  • I tried the route of an Ubuntu 20.04 VM through Hyper-V with GPU partitioning on my primary Win10-Pro host using the code and advice here https://github.com/seflerZ/oneclick-gpu-pv and here https://gist.github.com/krzys-h/e2def49966aa42bbd3316dfb794f4d6a but I could not get it to work (couldn't get the Nvidia drivers recognized and stable inside of the VM even though it looked like the GPU was shared correctly from the host) -- though supposedly the support is better on a Win11 host (note that InstancePath param is not supported from Win10 hosts when partitioning GPUs to VMs)
  • I also got exo/tinygrad working fine in a separate machine where I installed Ubuntu 20.04 as dual-boot
  • I upgraded my primary machine to Win11 and immediately I was able to run exo/tinygrad successfully inside of my existing 20.04 WSL without any other changes (like one of the people mentioned in this thread https://github.com/microsoft/WSL/issues/8447 ). Although now it looks like my Exo inside of WSL is not automatically seeing other nodes connected to the same router (which is a WSL networking issue I've seen in other contexts too). I was able to resolve this by using tailscale on all the machines (including inside of WSL2)
  • But then the next time I had to restart WSL in the Win11-Pro host, the CUDA Error 2, out of memory error was back
  • I did manage to get the GPU partitioning working in a Win11-Pro host to Ubuntu 20.04 guest (tweaking the instructions from https://gist.github.com/krzys-h/e2def49966aa42bbd3316dfb794f4d6a for kernel 5.15 (as mentioned by one of the commenters on that gist) and also using i3 window manager instead of Gnome) - however I still had the CUDA error 2 out of memory issue, which I'm guessing is because this partitioning approach in Hyper-V uses the same WSL drivers. One additional thing to note on this though is that it looks like the Hyper-V VM needs to have enough ram allocated to it to fit the whole model (I got a CUDA error 2 out of memory when I had an 8gb ram VM that was trying to load a 14gb model to a 24gb video card and that issue went away when I gave the VM 20gb of virtualized system ram)

pickettd avatar Oct 17 '24 20:10 pickettd

Hello, Team. Anybody found solution to avoid CUDA Error 2, out of memory?

loaded weights in 4041.00 ms, 8.03 GB loaded at 1.99 GB/s
Error processing tensor for shard Shard(model_id='mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated', start_layer=0, end_layer=15, n_layers=32): CUDA Error 2, out of memory
Traceback (most recent call last):
  File "/home/ffamax/exo/.venv/lib/python3.10/site-packages/tinygrad/device.py", line 152, in alloc
    try: return super().alloc(size, options)
  File "/home/ffamax/exo/.venv/lib/python3.10/site-packages/tinygrad/device.py", line 136, in alloc
    return self._alloc(size, options if options is not None else BufferOptions())
  File "/home/ffamax/exo/.venv/lib/python3.10/site-packages/tinygrad/runtime/ops_cuda.py", line 68, in _alloc
    return init_c_var(cuda.CUdeviceptr(), lambda x: check(cuda.cuMemAlloc_v2(ctypes.byref(x), size)))
  File "/home/ffamax/exo/.venv/lib/python3.10/site-packages/tinygrad/helpers.py", line 325, in init_c_var
    def init_c_var(ctypes_var, creat_cb): return (creat_cb(ctypes_var), ctypes_var)[1]
  File "/home/ffamax/exo/.venv/lib/python3.10/site-packages/tinygrad/runtime/ops_cuda.py", line 68, in <lambda>
    return init_c_var(cuda.CUdeviceptr(), lambda x: check(cuda.cuMemAlloc_v2(ctypes.byref(x), size)))
  File "/home/ffamax/exo/.venv/lib/python3.10/site-packages/tinygrad/runtime/ops_cuda.py", line 13, in check
    if status != 0: raise RuntimeError(f"CUDA Error {status}, {ctypes.string_at(init_c_var(ctypes.POINTER(ctypes.c_char)(), lambda x: cuda.cuGetErrorString(status, ctypes.byref(x)))).decode()}")  # noqa: E501
RuntimeError: CUDA Error 2, out of memory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ffamax/exo/exo/orchestration/standard_node.py", line 239, in _process_tensor
    result, inference_state, is_finished = await self.inference_engine.infer_tensor(request_id, shard, tensor, inference_state=inference_state)
  File "/home/ffamax/exo/exo/inference/tinygrad/inference.py", line 76, in infer_tensor
    await self.ensure_shard(shard)
  File "/home/ffamax/exo/exo/inference/tinygrad/inference.py", line 97, in ensure_shard
    self.model = await asyncio.get_event_loop().run_in_executor(self.executor, build_transformer, model_path, shard, "8B" if "8b" in shard.model_id.lower() else "70B")
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/ffamax/exo/exo/inference/tinygrad/inference.py", line 48, in build_transformer
    load_state_dict(model, weights, strict=False, consume=False)  # consume=True
  File "/home/ffamax/exo/.venv/lib/python3.10/site-packages/tinygrad/nn/state.py", line 129, in load_state_dict
    else: v.replace(state_dict[k].to(v.device)).realize()
  File "/home/ffamax/exo/.venv/lib/python3.10/site-packages/tinygrad/tensor.py", line 3500, in _wrapper
    ret = fn(*args, **kwargs)
  File "/home/ffamax/exo/.venv/lib/python3.10/site-packages/tinygrad/tensor.py", line 213, in realize
    run_schedule(*self.schedule_with_vars(*lst), do_update_stats=do_update_stats)
  File "/home/ffamax/exo/.venv/lib/python3.10/site-packages/tinygrad/engine/realize.py", line 224, in run_schedule
    ei.run(var_vals, do_update_stats=do_update_stats)
  File "/home/ffamax/exo/.venv/lib/python3.10/site-packages/tinygrad/engine/realize.py", line 173, in run
    bufs = [cast(Buffer, x) for x in self.bufs] if jit else [cast(Buffer, x).ensure_allocated() for x in self.bufs]
  File "/home/ffamax/exo/.venv/lib/python3.10/site-packages/tinygrad/engine/realize.py", line 173, in <listcomp>
    bufs = [cast(Buffer, x) for x in self.bufs] if jit else [cast(Buffer, x).ensure_allocated() for x in self.bufs]
  File "/home/ffamax/exo/.venv/lib/python3.10/site-packages/tinygrad/device.py", line 77, in ensure_allocated
    def ensure_allocated(self) -> Buffer: return self.allocate() if not hasattr(self, '_buf') else self
  File "/home/ffamax/exo/.venv/lib/python3.10/site-packages/tinygrad/device.py", line 86, in allocate
    self._buf = opaque if opaque is not None else self.allocator.alloc(self.nbytes, self.options)
  File "/home/ffamax/exo/.venv/lib/python3.10/site-packages/tinygrad/device.py", line 155, in alloc
    return super().alloc(size, options)
  File "/home/ffamax/exo/.venv/lib/python3.10/site-packages/tinygrad/device.py", line 136, in alloc
    return self._alloc(size, options if options is not None else BufferOptions())
  File "/home/ffamax/exo/.venv/lib/python3.10/site-packages/tinygrad/runtime/ops_cuda.py", line 68, in _alloc
    return init_c_var(cuda.CUdeviceptr(), lambda x: check(cuda.cuMemAlloc_v2(ctypes.byref(x), size)))
  File "/home/ffamax/exo/.venv/lib/python3.10/site-packages/tinygrad/helpers.py", line 325, in init_c_var
    def init_c_var(ctypes_var, creat_cb): return (creat_cb(ctypes_var), ctypes_var)[1]
  File "/home/ffamax/exo/.venv/lib/python3.10/site-packages/tinygrad/runtime/ops_cuda.py", line 68, in <lambda>
    return init_c_var(cuda.CUdeviceptr(), lambda x: check(cuda.cuMemAlloc_v2(ctypes.byref(x), size)))
  File "/home/ffamax/exo/.venv/lib/python3.10/site-packages/tinygrad/runtime/ops_cuda.py", line 13, in check
    if status != 0: raise RuntimeError(f"CUDA Error {status}, {ctypes.string_at(init_c_var(ctypes.POINTER(ctypes.c_char)(), lambda x: cuda.cuGetErrorString(status, ctypes.byref(x)))).decode()}")  # noqa: E501
RuntimeError: CUDA Error 2, out of memory
SendTensor tensor shard=Shard(model_id='mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated', start_layer=13, end_layer=21, n_layers=32) tensor=array([[[ 0.1719  ,  0.2925  , -0.5254  , ...,  0.508   ,  0.413   ,
         -0.2148  ],
        [ 0.1719  ,  0.2925  , -0.5254  , ...,  0.508   ,  0.413   ,
         -0.2147  ],
        [ 0.0528  ,  0.006165,  0.02719 , ...,  0.10626 ,  0.01511 ,
          0.00949 ],
        ...,
        [-0.004456,  0.09314 ,  0.00821 , ..., -0.04398 , -0.02438 ,
         -0.0692  ],
        [-0.02142 ,  0.0279  , -0.0904  , ..., -0.005966, -0.03247 ,
         -0.0575  ],
        [-0.0843  , -0.0978  , -0.00925 , ..., -0.01285 , -0.05417 ,
         -0.0532  ]]], dtype=float16) request_id='cda2e3d0-2409-4e39-938d-029d198e67de' result: None

FFAMax avatar Oct 28 '24 03:10 FFAMax

In my case GPUs was not defined so it was unable properly proceed. Once FLOPs defined, it was able split according to available VRAM on all GPUs. Example https://github.com/exo-explore/exo/pull/393/files

FFAMax avatar Oct 28 '24 04:10 FFAMax