airllm
airllm copied to clipboard
Mac Airllm Inference tigerbot-70b-chat-v2
from sys import platform
from airllm import AutoModel
import mlx.core as mx
assert platform == "darwin", "this example is supposed to be run on mac os"
# model = AutoModel.from_pretrained("01-ai/Yi-34B")#"garage-bAInd/Platypus2-7B")
model = AutoModel.from_pretrained("/Users/ageorgios/Models/tigerbot-70b-chat-v2")
input_text = [
'Tell me the purpose of life',
]
MAX_LENGTH = 128
input_tokens = model.tokenizer(input_text,
return_tensors="np",
return_attention_mask=False,
truncation=True,
max_length=MAX_LENGTH,
padding=False)
input_tokens
generation_output = model.generate(
mx.array(input_tokens['input_ids']),
max_new_tokens=3,
use_cache=True,
return_dict_in_generate=True)
print(generation_output)
This is my code and the output is not correct I think.
(.venv) ageorgios@mac airllm % python main.py
/Users/ageorgios/Models/airllm/.venv/lib/python3.11/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
warn("The installed version of bitsandbytes was compiled without GPU support. "
'NoneType' object has no attribute 'cadam32bit_grad_fp32'
saved layers already found in /Users/ageorgios/Models/tigerbot-70b-chat-v2/splitted_model
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
running layers: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 80/80 [00:56<00:00, 1.40it/s]
running layers: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 80/80 [00:55<00:00, 1.44it/s]
running layers: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 80/80 [00:56<00:00, 1.41it/s]
.</s>
(.venv) ageorgios@mac airllm %