exo icon indicating copy to clipboard operation
exo copied to clipboard

nvidia nvml destory the start in docker without nvidia gpu

Open 2jiangjiang opened this issue 1 year ago • 5 comments

I have no nvidia gpu and use docker to run exo

  1. docker run ubuntu
  2. git clone exo
  3. apt install build-essential python3 python3-venv python3-pip libgl1-mesa-dev libglib2.0-0
  4. source install.sh
  5. report the error

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used. None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used. Selected inference engine: None


/ _ \ / / _ \ | /> < (_) | _/_/____/

Detected system: Linux Inference engine name after selection: tinygrad Using inference engine: TinygradDynamicShardInferenceEngine with shard downloader: HFShardDownloader [58906] Chat interface started:

  • http://127.0.0.1:52415
  • http://172.17.0.2:52415 ChatGPT API endpoint served at:
  • http://127.0.0.1:52415/v1/chat/completions
  • http://172.17.0.2:52415/v1/chat/completions Traceback (most recent call last): File "/exo/.venv/lib/python3.12/site-packages/pynvml.py", line 2248, in _LoadNvmlLibrary nvmlLib = CDLL("libnvidia-ml.so.1") ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/ctypes/init.py", line 379, in init self._handle = _dlopen(self._name, mode) ^^^^^^^^^^^^^^^^^^^^^^^^^ OSError: libnvidia-ml.so.1: cannot open shared object file: No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/exo/.venv/bin/exo", line 5, in from exo.main import run File "/exo/exo/main.py", line 131, in node = Node( ^^^^^ File "/exo/exo/orchestration/node.py", line 40, in init self.device_capabilities = device_capabilities() ^^^^^^^^^^^^^^^^^^^^^ File "/exo/exo/topology/device_capabilities.py", line 151, in device_capabilities return linux_device_capabilities() ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/exo/exo/topology/device_capabilities.py", line 189, in linux_device_capabilities pynvml.nvmlInit() File "/exo/.venv/lib/python3.12/site-packages/pynvml.py", line 2220, in nvmlInit nvmlInitWithFlags(0) File "/exo/.venv/lib/python3.12/site-packages/pynvml.py", line 2203, in nvmlInitWithFlags _LoadNvmlLibrary() File "/exo/.venv/lib/python3.12/site-packages/pynvml.py", line 2250, in _LoadNvmlLibrary _nvmlCheckReturn(NVML_ERROR_LIBRARY_NOT_FOUND) File "/exo/.venv/lib/python3.12/site-packages/pynvml.py", line 979, in _nvmlCheckReturn raise NVMLError(ret) pynvml.NVMLError_LibraryNotFound: NVML Shared Library Not Found

2jiangjiang avatar Dec 14 '24 06:12 2jiangjiang

I test that tinygrad.Device.DEFAULT return value "GPU". When I delete "Device.DEFAULT == "GPU"" in nvidia case exo worked.I don't know if it can work properly with oneAPI(Intel GPU)

2jiangjiang avatar Dec 14 '24 07:12 2jiangjiang

I test that tinygrad.Device.DEFAULT return value "GPU". When I delete Device.DEFAULT == "NV" in nvidia case exo worked.I don't know if it can work properly with oneAPI(Intel GPU)

You'll need to install the prerequisites listed in the README:

For Linux with NVIDIA GPU support (Linux-only, skip if not using Linux or NVIDIA):

AlexCheema avatar Dec 14 '24 20:12 AlexCheema

我测试了 tinygrad.Device.DEFAULT 返回值“GPU”。当我在 nvidia 情况下删除 Device.DEFAULT ==“NV”时,exo 起作用了。我不知道它是否可以与 oneAPI(Intel GPU)正常工作

您需要安装 README 中列出的先决条件:

对于支持 NVIDIA GPU 的 Linux(仅限 Linux,如果不使用 Linux 或 NVIDIA,请跳过):

I not use NVIDIA GPU,but I use INTEL GPU but the case enter the incorrect NVIDIA case so it was a bug and need patch

2jiangjiang avatar Dec 16 '24 03:12 2jiangjiang

我测试了 tinygrad.Device.DEFAULT 返回值“GPU”。当我在 nvidia 情况下删除 Device.DEFAULT ==“NV”时,exo 起作用了。我不知道它是否可以与 oneAPI(Intel GPU)正常工作

It's my mistake I have delete "Device.DEFAULT=="GPU"" not "Device.DEFAULT=="NV""

2jiangjiang avatar Dec 16 '24 03:12 2jiangjiang

Ni Hao @2jiangjiang, if you want to go ahead and craft a line with the device name specs for your card, I can add it to the CHIP_FLOPS list for my Intel Arc Support PR...

Should look something like this: https://github.com/exo-explore/exo/pull/791/files#diff-cf2f88e490e7f1b3c6256e98545897497902d040113f29dafc5fc6054b6b2151R144

deftdawg avatar Mar 21 '25 05:03 deftdawg