从本地加载模型
当我换模型进行测试,会有如下报错:
"object": "error",
"message": "Only openbuddy-zephyr-7b allowed now, your model llama-3-8b",
"code": 40301
显示只有openbuddy-zephyr-7b这个模型可以用,有没有更简便的方式使用本地模型
可以硬编码模型路径修改三个地方: exo/exo/api/chatgpt_api.py: resolve_tinygrad_tokenizer function
def resolve_tinygrad_tokenizer(model_id: str): if model_id == "llama3-8b-sfr":
here i modified
return AutoTokenizer.from_pretrained("TriAiExperiments/SFR-Iterative-DPO-LLaMA-3-8B-R")
return AutoTokenizer.from_pretrained("/nasroot/models/Meta-Llama-3-8B")
elif model_id == "llama3-70b-sfr": return AutoTokenizer.from_pretrained("TriAiExperiments/SFR-Iterative-DPO-LLaMA-3-8B-R") else: raise ValueError(f"tinygrad doesnt currently support arbitrary model downloading. unsupported model: {model_id}") exo/exo/api/chatgpt_api.py: resolve_tokenizer function
async def resolve_tokenizer(model_id: str): try: # if DEBUG >= 2: print(f"Trying AutoTokenizer for {model_id}")
# here i modified.
# return AutoTokenizer.from_pretrained(model_id)
if DEBUG >= 2: print(f"Trying AutoTokenizer for /nasroot/models/Meta-Llama-3-8B")
return AutoTokenizer.from_pretrained("/nasroot/models/Meta-Llama-3-8B")
except Exception as e: if DEBUG >= 2: print(f"Failed to load tokenizer for {model_id}. Falling back to tinygrad tokenizer. Error: {e}") import traceback
if DEBUG >= 2: print(traceback.format_exc())
try: if DEBUG >= 2: print(f"Trying tinygrad tokenizer for {model_id}") return resolve_tinygrad_tokenizer(model_id) except Exception as e: if DEBUG >= 2: print(f"Failed again to load tokenizer for {model_id}. Falling back to mlx tokenizer. Error: {e}") import traceback
if DEBUG >= 2: print(traceback.format_exc())
if DEBUG >= 2: print(f"Trying mlx tokenizer for {model_id}") from exo.inference.mlx.sharded_utils import get_model_path, load_tokenizer
return load_tokenizer(await get_model_path(model_id)) exo/inference/tinygard/inference.py: ensure_shard function
async def ensure_shard(self, shard: Shard): if self.shard == shard: return
model_path = Path(shard.model_id)
models_dir = Path(_cache_dir) / "tinygrad" / "downloads" model_path = models_dir / shard.model_id size = "8B"
here i modified.
model_path = Path("/nasroot/models/Meta-Llama-3-8B")
if Path(model_path / "model.safetensors.index.json").exists(): model = model_path else:
if DEBUG >= 2: print(f"Downloading tinygrad model {shard.model_id}...")
if shard.model_id.lower().find("llama3-8b-sfr") != -1:
await fetch_async(
"https://huggingface.co/bofenghuang/Meta-Llama-3-8B/resolve/main/original/tokenizer.model",
"tokenizer.model",
subdir=shard.model_id,
)
可以参考下
@artistlu 感谢您的建议,我依旧没有成功,我无法连接到huggingface,未能成功运行该项目
@artistlu 感谢您的建议,我依旧没有成功,我无法连接到 huggingface,未能成功运行该项目
我也不能链接 huggingface,我是下在这https://www.modelscope.cn/models/LLM-Research/Meta-Llama-3-8B/files 下载模型,放在每个节点固定目录,硬编码三个地方就可以加载了。
希望有帮助
@artistlu 谢谢您提供的帮助,我实验成功了,但是遇到了新的问题,想请教您一下,我发现在两个节点上都会加载模型,但是我另一个节点只有11G的内存,加载模型的过程中就会出现内存不足的错误,那这样的话岂不是占用更多的显存,而且我指定用所有的卡去加载模型,也依旧只会使用一张卡,以下是报错内容
ram used: 4.84 GB, layers.11.attention.wv.weight : 35%|█████████████████████████████████▌ | 102/292 [00:04<00:08, 21.75it/s]
ram used: 4.85 GB, layers.11.attention.wo.weight : 35%|█████████████████████████████████▊ | 103/292 [00:04<00:08, 21.79it/s]
ram used: 4.88 GB, layers.11.feed_forward.w1.weight : 36%|██████████████████████████████████▏ | 104/292 [00:04<00:08, 21.54it/s]
ram used: 5.00 GB, layers.11.feed_forward.w2.weight : 36%|██████████████████████████████████▌ | 105/292 [00:04<00:08, 21.32it/s]
ram used: 5.12 GB, layers.11.feed_forward.w3.weight : 36%|██████████████████████████████████▊ | 106/292 [00:05<00:08, 21.13it/s]
ram used: 5.23 GB, layers.11.attention_norm.weight : 37%|███████████████████████████████████▏ | 107/292 [00:05<00:08, 21.29it/s]
ram used: 5.23 GB, layers.11.ffn_norm.weight : 37%|███████████████████████████████████▌ | 108/292 [00:05<00:08, 21.47it/s]
ram used: 5.23 GB, layers.12.attention.wq.weight : 37%|███████████████████████████████████▊ | 109/292 [00:05<00:08, 21.53it/s]
ram used: 5.27 GB, layers.12.attention.wk.weight : 38%|████████████████████████████████████▏ | 110/292 [00:05<00:08, 21.67it/s]
ram used: 5.28 GB, layers.12.attention.wv.weight : 38%|████████████████████████████████████▍ | 111/292 [00:05<00:08, 21.81it/s]
ram used: 5.29 GB, layers.12.attention.wo.weight : 38%|████████████████████████████████████▊ | 112/292 [00:05<00:08, 21.87it/s]
ram used: 5.32 GB, layers.12.feed_forward.w1.weight : 39%|█████████████████████████████████████▏ | 113/292 [00:05<00:08, 21.69it/s]
loaded weights in 5218.06 ms, 5.44 GB loaded at 1.04 GB/s
Error processing tensor for shard Shard(model_id='llama3-8b-sfr', start_layer=0, end_layer=31, n_layers=32): CUDA Error 2, out of memory
Traceback (most recent call last):
File "/root/anaconda3/envs/jky_exo/lib/python3.12/site-packages/tinygrad/device.py", line 146, in alloc
try: return super().alloc(size, options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/jky_exo/lib/python3.12/site-packages/tinygrad/device.py", line 134, in alloc
return self._alloc(size, options if options is not None else BufferOptions())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/jky_exo/lib/python3.12/site-packages/tinygrad/runtime/ops_cuda.py", line 114, in _alloc
return init_c_var(cuda.CUdeviceptr(), lambda x: check(cuda.cuMemAlloc_v2(ctypes.byref(x), size)))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/jky_exo/lib/python3.12/site-packages/tinygrad/helpers.py", line 291, in init_c_var
def init_c_var(ctypes_var, creat_cb): return (creat_cb(ctypes_var), ctypes_var)[1]
^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/jky_exo/lib/python3.12/site-packages/tinygrad/runtime/ops_cuda.py", line 114, in <lambda>
return init_c_var(cuda.CUdeviceptr(), lambda x: check(cuda.cuMemAlloc_v2(ctypes.byref(x), size)))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/jky_exo/lib/python3.12/site-packages/tinygrad/runtime/ops_cuda.py", line 26, in check
if status != 0: raise RuntimeError(f"CUDA Error {status}, {ctypes.string_at(init_c_var(ctypes.POINTER(ctypes.c_char)(), lambda x: cuda.cuGetErrorString(status, ctypes.byref(x)))).decode()}") # noqa:
E501
RuntimeError: CUDA Error 2, out of memory
另一张卡就显示加载完成,我的理解不是应该共用显存么