Meta just released there LLaMA model family https://github.com/facebookresearch/llama Can we got support for that? They calim that the 13B model is better than GPT 3 175B model

This is how to use the model in the web UI:

LLaMA TUTORIAL

-- oobabooga, March 9th, 2023

Feb 27 '23 18:02 ye7iaserag

The models are not public yet, unfortunately. You have to request access.

Feb 27 '23 18:02 oobabooga

Psst. Somebody leaked them https://twitter.com/Teknium1/status/1631322496388722689

Mar 02 '23 18:03 MetaIX

Of course, the weights themselves are closed. But the code from the repository should be enough to add support. And where the end users will download the weights from is their problem.

Mar 02 '23 19:03 Sumanai

https://github.com/facebookresearch/llama/pull/73

Mar 03 '23 15:03 catboxanon

Done https://github.com/oobabooga/text-generation-webui/commit/ea5c5eb3daa5d3f319f4a6dbc6d02b7f993d1881

Install LLaMa as in their README:

conda activate textgen
git clone https://github.com/facebookresearch/llama
cd llama
pip install -r requirements.txt
pip install -e .

Put the model that you downloaded using your academic credentials on models/LLaMA-7B (the folder name must start with llama)
Put a copy of the files inside that folder too: tokenizer.model and tokenizer_checklist.chk
Start the web ui. I have tested with

python server.py --no-stream --model LLaMA-7B

llamas

Mar 03 '23 17:03 oobabooga

Getting a CUDA out-of-memory error- I assume lowmem support isn't included yet?

Mar 03 '23 21:03 moorehousew

This isn't part of Hugging Face yet, so it doesn't have access to 8bit and CPU offloading.

The 7B model uses 14963MiB VRAM on my machine. Reducing the max_seq_len parameter from 2048 to 512 makes this go down to 13843MiB.

Mar 03 '23 23:03 oobabooga

I get a bunch of dependency errors when launching despite setting up LLaMa beforehand (definitely my own fault and probably because of a messed up conda environment)

ModuleNotFoundError: No module named 'fire'
ModuleNotFoundError: No module named 'fairscale'

etc. Any chance you could include these in the default webui requirements assuming they aren't too heavy?

Mar 04 '23 00:03 musicurgy

@musicurgy did you try pip install -r requirements.txt as in https://github.com/oobabooga/text-generation-webui/issues/147#issuecomment-1453880733?

Mar 04 '23 00:03 oobabooga

Yeah, after a bit of a struggle I ended up getting it working by just copying all the dependencies into the webui folder. So far the model is really interesting. Thanks for supporting it.

Mar 04 '23 01:03 musicurgy

Awesome stuff. I'm able to load LLaMA-7b but trying to load LLaMA-13b crashes with the error:

Traceback (most recent call last):
  File "/home/user/Documents/oobabooga/text-generation-webui/server.py", line 189, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "/home/user/Documents/oobabooga/text-generation-webui/modules/models.py", line 94, in load_model
    model = LLaMAModel.from_pretrained(Path(f'models/{model_name}'))
  File "/home/user/Documents/oobabooga/text-generation-webui/modules/LLaMA.py", line 82, in from_pretrained
    generator = load(
  File "/home/user/Documents/oobabooga/text-generation-webui/modules/LLaMA.py", line 44, in load
    assert world_size == len(
AssertionError: Loading a checkpoint for MP=2 but world size is 1

Mar 04 '23 01:03 generic-username0718

Anyone reading this you can get past the issue above by changing the world_size variable found in modules/LLaMA.py like this:

def setup_model_parallel() -> Tuple[int, int]: local_rank = int(os.environ.get("LOCAL_RANK", -1)) world_size = 2

My issue now is I'm running out of VRAM. I'm running dual 3090s and should be able to load the model if it's split among the cards...

Mar 04 '23 02:03 generic-username0718

Is there a parameter I need to pass to oobabooga to tell it to split the model among my two 3090 gpus?

Mar 04 '23 02:03 generic-username0718

Is there a parameter I need to pass to oobabooga to tell it to split the model among my two 3090 gpus?

Try --gpu-memory 10 5, at least that's what the README says.

Mar 04 '23 02:03 Morb0

Sorry super dumb but do I pass this to start-webui.sh? Like

sh start-webui.sh --gpu-memory 10 5?

Mar 04 '23 02:03 generic-username0718

Sorry super dumb but do I pass this to start-webui.sh? Like

sh start-webui.sh --gpu-memory 10 5?

Ah, that should work, but if not, edit the file and add this at the end of call python server.py --auto-devices --cai-chat

Mar 04 '23 02:03 Morb0

Thanks friend! I was able to get it with call python server.py --gpu-memory 20 20 --cai-chat

Mar 04 '23 03:03 generic-username0718

--gpu-memory should have no effect on LLaMA. This is for models loaded using the from_pretrained function from HF.

For LLaMA, the correct way is to change the global variables inside LLaMA.py like @generic-username0718 did, but I am not very familiar with the parameters yet.

Mar 04 '23 03:03 oobabooga

--gpu-memory should have no effect on LLaMA. This is for models loaded using the from_pretrained function from HF.

For LLaMA, the correct way is to change the global variables inside LLaMA.py like @generic-username0718 did, but I am not very familiar with the parameters yet.

I was starting to question my sanity... I think I accidentally was loading opt-13b instead... Sorry if I got people's hopes up

I'm still trying to split the model

Edit: Looks like they've already asked this here: https://github.com/facebookresearch/llama/issues/88

Mar 04 '23 03:03 generic-username0718

bad news for the guys hoping to run 13B

Loading LLaMA-13B...
[W ProcessGroupGloo.cpp:694] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
> initializing model parallel with size 2
> initializing ddp with size 1
> initializing pipeline with size 1
Loading
Traceback (most recent call last):
  File "/UI/text-generation-webui/server.py", line 188, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "/UI/text-generation-webui/modules/models.py", line 94, in load_model
    model = LLaMAModel.from_pretrained(Path(f'models/{model_name}'))
  File "/UI/text-generation-webui/modules/LLaMA.py", line 82, in from_pretrained
    generator = load(
  File "/UI/text-generation-webui/modules/LLaMA.py", line 61, in load
    model.load_state_dict(checkpoint, strict=False)
  File "/UI/text-generation-webui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1671, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for Transformer:
        size mismatch for tok_embeddings.weight: copying a param with shape torch.Size([32000, 2560]) from checkpoint, the shape in current model is torch.Size([32000, 5120]).
        size mismatch for layers.0.attention.wq.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([5120, 5120]).
        size mismatch for layers.0.attention.wk.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([5120, 5120]).
        size mismatch for layers.0.attention.wv.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([5120, 5120]).
        size mismatch for layers.0.attention.wo.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([5120, 5120]).
        size mismatch for layers.0.feed_forward.w1.weight: copying a param with shape torch.Size([6912, 5120]) from checkpoint, the shape in current model is torch.Size([13824, 5120]).
        size mismatch for layers.0.feed_forward.w2.weight: copying a param with shape torch.Size([5120, 6912]) from checkpoint, the shape in current model is torch.Size([5120, 13824]).
        size mismatch for layers.0.feed_forward.w3.weight: copying a param with shape torch.Size([6912, 5120]) from checkpoint, the shape in current model is torch.Size([13824, 5120]).
        size mismatch for layers.1.attention.wq.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([5120, 5120]).
        etc........

Mar 04 '23 05:03 USBhost

Did you set MP to '2' here?

https://github.com/oobabooga/text-generation-webui/blob/main/modules/LLaMA.py#L18

See

https://github.com/facebookresearch/llama#inference

Mar 04 '23 05:03 oobabooga

LLaMA-7B can be run on CPU instead of GPU using this fork of the LLaMA repo: https://github.com/markasoftware/llama-cpu

To quote the author "On a Ryzen 7900X, the 7B model is able to infer several words per second, quite a lot better than you'd expect!"

Mar 04 '23 05:03 MarkSchmidty

Did you set MP to '2' here?

https://github.com/oobabooga/text-generation-webui/blob/main/modules/LLaMA.py#L18

See

https://github.com/facebookresearch/llama#inference

from llama import LLaMA, ModelArgs, Tokenizer, Transformer

os.environ['RANK'] = '0'
os.environ['WORLD_SIZE'] = '1'
os.environ['MP'] = '2'
os.environ['MASTER_ADDR'] = '127.0.0.1'
os.environ['MASTER_PORT'] = '2223'

def setup_model_parallel() -> Tuple[int, int]:
    local_rank = int(os.environ.get("LOCAL_RANK", -1))
    world_size = 2

    torch.distributed.init_process_group("gloo")
    initialize_model_parallel(world_size)
    torch.cuda.set_device(local_rank)

    # seed must be the same in all processes
    torch.manual_seed(1)
    return local_rank, world_size

I sure did. Also those os.environ don't seem to work. 7B loads fine. PS my GPU is a A6000

Mar 04 '23 07:03 USBhost

Did you set MP to '2' here? https://github.com/oobabooga/text-generation-webui/blob/main/modules/LLaMA.py#L18 See https://github.com/facebookresearch/llama#inference
from llama import LLaMA, ModelArgs, Tokenizer, Transformer

os.environ['RANK'] = '0'
os.environ['WORLD_SIZE'] = '1'
os.environ['MP'] = '2'
os.environ['MASTER_ADDR'] = '127.0.0.1'
os.environ['MASTER_PORT'] = '2223'

def setup_model_parallel() -> Tuple[int, int]:
    local_rank = int(os.environ.get("LOCAL_RANK", -1))
    world_size = 2

    torch.distributed.init_process_group("gloo")
    initialize_model_parallel(world_size)
    torch.cuda.set_device(local_rank)

    # seed must be the same in all processes
    torch.manual_seed(1)
    return local_rank, world_size
I sure did. Also those os.environ don't seem to work. 7B loads fine. PS my GPU is a A6000

I also get the same error, with 13b.

Mar 04 '23 07:03 TheZennou

Anyone else getting really poor results on 7B? I've tried many prompts and parameter variations and it generally ends up as mostly nonsense with lots of repetition. It might just be the model but I saw some 7B output examples posted online that seemed way better than anything I was getting.

Mar 04 '23 07:03 hdelattre

Is it possible to reduce computation precision on CPU? Down to 8 bit?

Mar 04 '23 08:03 BarsMonster

Someone made a fork of llama github that apparently runs in 8bit : https://github.com/tloen/llama-int8

Zero idea if it works or anything.

Mar 04 '23 09:03 Manimap

I'm getting the following error when trying to run the 7B model on my rtx 3090, can someone help?

C:\Users\Username\Documents\Git\text-generation-webui>python server.py --listen --no-stream --model LLaMA-7B
Loading LLaMA-7B...
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [3ca52znvmj.adobe.io]:2223 (system error: 10049 - The requested address is not valid in its context.).
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [3ca52znvmj.adobe.io]:2223 (system error: 10049 - The requested address is not valid in its context.).
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
Loading
Traceback (most recent call last):
  File "C:\Users\Username\Documents\Git\text-generation-webui\server.py", line 188, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "C:\Users\Username\Documents\Git\text-generation-webui\modules\models.py", line 94, in load_model
    model = LLaMAModel.from_pretrained(Path(f'models/{model_name}'))
  File "C:\Users\Username\Documents\Git\text-generation-webui\modules\LLaMA.py", line 82, in from_pretrained
    generator = load(
  File "C:\Users\Username\Documents\Git\text-generation-webui\modules\LLaMA.py", line 58, in load
    torch.set_default_tensor_type(torch.cuda.HalfTensor)
  File "C:\Users\Username\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\__init__.py", line 348, in set_default_tensor_type
    _C._set_default_tensor_type(t)
TypeError: type torch.cuda.HalfTensor not available. Torch not compiled with CUDA enabled.

Mar 04 '23 09:03 hopto-dot

@hopto-dot Go here and run the pip command for the 11.7 build on your OS: https://pytorch.org/get-started/locally/

Mar 04 '23 09:03 hdelattre

Thank you, I'll try that

Mar 04 '23 09:03 hopto-dot

text-generation-webui
text-generation-webui copied to clipboard

Support for LLaMA models

LLaMA TUTORIAL

text-generation-webui text-generation-webui copied to clipboard

Support for LLaMA models

LLaMA TUTORIAL

text-generation-webui
text-generation-webui copied to clipboard