Mark Schmidt comments

Results 95 comments of


                                            Mark Schmidt

Add LLaMA to Colab

I actually did get 13B to run in a free colab despite what the requirements table above says. It seems there are small efficiency improvements being made every day, allowing...

Add LLaMA to Colab

13B in 8bit loaded fine for me without Pro and never used more than 3GB of RAM during loading. The VRAM could not fit the full 2048 context, but it...

Support StableLM From StabilityAI

Yes it's using GPT-NeoX architecture. The model details can be seen here: https://github.com/Stability-AI/StableLM/blob/main/configs/stablelm-base-alpha-7b.yaml ```yaml # model settings "num-layers": 16, "hidden-size": 6144, "num-attention-heads": 48, "seq-length": 4096, "max-position-embeddings": 4096, # architecture design...

GPTQ Quantization (3-bit and 4-bit)

The [GitHub Issue for text-generation-webui's implementation of GPTQ-for-LLaMA](https://github.com/oobabooga/text-generation-webui/issues/177) may also be a helpful reference.

GPTQ Quantization (3-bit and 4-bit)

Well three is likely a minor, currently unknown (pending benchmarks), benefit to GPTQ for 4bit-- yes? Additionally, once 4bit GPTQ is implemented 3bit and 2bit are not much additional work...

GPTQ Quantization (3-bit and 4-bit)

WebAssembly implementation is blocked pending 3-bit inference due to WASM's 4GB memory constraint.

GPTQ Quantization (3-bit and 4-bit)

@zoidbb or @qwopqwop200 might have an answer for the question above.

GPTQ Quantization (3-bit and 4-bit)

I'm curious what your actual benchmark results were. A handful of use cases are blocked pending fitting 7B with inference into 4GB of RAM, including [LLaMA in WebAssembly](https://github.com/ggerganov/llama.cpp/issues/97) (which has...

GPTQ Quantization (3-bit and 4-bit)

Those 3bit graphs look better than I expected actually. This is quite promising. Thanks for your contributions @Ayushk4!

GPTQ Quantization (3-bit and 4-bit)

> > I'm curious what your actual benchmark results were. A handful of use cases are blocked pending fitting 7B with inference into 4GB of RAM, including [LLaMA in WebAssembly](https://github.com/ggerganov/llama.cpp/issues/97)...