Paul Richardson

Results 13 comments of Paul Richardson

don't quote me on this but I think either your groupsize is off or you don't have gptq setup right... I think...

With only 4GB of VRAM @bloodsign is probably right... you'll be OOM with most anything. Regular offloading to CPU is *usually* pretty slow with one exception... Llama.cpp I'd recommend looking...

Ok so I managed to get the model to LOAD by using your suggestion just change `monkey_patch_gptq_lora.py` as indicated below ```python def load_model_llama(model_name): config_path = str(Path(f'{shared.args.model_dir}/{model_name}')) model_path = str(find_quantized_model_file(model_name)) >...

> I thought that not splitting "LlamaDecoderLayer" was enough is it not? I only did offloading to CPU with this. If by not splitting "LlamaDecodeLayer" you mean modifying `autograd_4bit.py` on...

not trying to be rude but you gotta give us more to work on... can you try uploading 1. Complete log from prompt to prompt 2. Screenshot 3. System Info...

This would be a killer feature... I agree

> I suggest using the training script in https://github.com/tloen/alpaca-lora directly. > Multigpu requires torchrun, which is a mutiprocess structure too hard to manage in a webui. You should use a...

> I suggest using the training script in https://github.com/tloen/alpaca-lora directly. Multigpu requires torchrun, which is a mutiprocess structure too hard to manage in a webui. You should use a script...

So... tldr new transformer breaks quants. the patch is to change the contents of special_tokens_map.json and tokenizer_config.json to match content of ooba here https://github.com/oobabooga/text-generation-webui/issues/931#issuecomment-1501259027 ?

~~... I'm still getting gibberish~~ I got it by 1. downloading the model from https://huggingface.co/Neko-Institute-of-Science/LLaMA-65B-HF/tree/main 2. replacing `special_tokens_map.json` and `tokenizer_config.json` with the ones here https://huggingface.co/chavinlo/gpt4-x-alpaca