Mark Schmidt comments

Results 95 comments of


                                            Mark Schmidt

how to run the largest possible model on a single A100 80Gb

> > It would be great if FAIR could provide some guidance on vram requirements > > See: [memory requirements for each model size](https://github.com/cedrickchee/llama/blob/main/chattyllama/hardware.md#memory-requirements-for-each-model-size). > > It's a small place...

how to run the largest possible model on a single A100 80Gb

4-bit for LLaMA is underway https://github.com/oobabooga/text-generation-webui/issues/177 65B in int4 fits on a single v100 40GB, even further reducing the cost to access this powerful model. Int4 LLaMA VRAM usage is...

how to run the largest possible model on a single A100 80Gb

That particular 4-bit implementation is mostly a proof of concept at this point. Bitsandbytes may be getting some 4-bit functionality towards the end of this month. Best to wait for...

GPTQ quantization(3 or 4 bit quantization) support for LLaMa

https://mobile.twitter.com/Tim_Dettmers/status/1605209177919750147 "Our analysis is extensive, spanning 5 models (BLOOM, BLOOM, Pythia, GPT-2, OPT), from 3 to 8-bit precision, and from 19M to 66B scale. We find the same result again...

GPTQ quantization(3 or 4 bit quantization) support for LLaMa

![image](https://user-images.githubusercontent.com/5949853/223291180-07287b36-9321-45fc-96b6-ad75a33e8726.png) Something seems off. LLaMA-30B is ~60GB in fp16. I would expect it to be around 1/4 of that size in 4bit, ie. 15GB. 12GB is considerably smaller and about...

Support for FLAN models

+1 Fun fact: Unlike LLaMA, FLAN is actually open source (Apache License).

CUDA Setup Failed after trying to run at 8-bit

> For anybody having troubles still, you can try using newer library - https://github.com/james-things/bitsandbytes-prebuilt-all_arch Using v37 did it for me finally :) https://github.com/oobabooga/text-generation-webui/issues/20#issuecomment-1455762694 This may not be the issue, but...

Mark Schmidt

how to run the largest possible model on a single A100 80Gb

how to run the largest possible model on a single A100 80Gb

how to run the largest possible model on a single A100 80Gb

GPTQ quantization(3 or 4 bit quantization) support for LLaMa

GPTQ quantization(3 or 4 bit quantization) support for LLaMa

Support for FLAN models

CUDA Setup Failed after trying to run at 8-bit

Improving quality with 8bit?

Improving quality with 8bit?

Change model license to Apache License, Version 2.0