Mark Schmidt

Results 95 comments of Mark Schmidt

I actually did get 13B to run in a free colab despite what the requirements table above says. It seems there are small efficiency improvements being made every day, allowing...

13B in 8bit loaded fine for me without Pro and never used more than 3GB of RAM during loading. The VRAM could not fit the full 2048 context, but it...

Yes it's using GPT-NeoX architecture. The model details can be seen here: https://github.com/Stability-AI/StableLM/blob/main/configs/stablelm-base-alpha-7b.yaml ```yaml # model settings "num-layers": 16, "hidden-size": 6144, "num-attention-heads": 48, "seq-length": 4096, "max-position-embeddings": 4096, # architecture design...

The [GitHub Issue for text-generation-webui's implementation of GPTQ-for-LLaMA](https://github.com/oobabooga/text-generation-webui/issues/177) may also be a helpful reference.

Well three is likely a minor, currently unknown (pending benchmarks), benefit to GPTQ for 4bit-- yes? Additionally, once 4bit GPTQ is implemented 3bit and 2bit are not much additional work...

WebAssembly implementation is blocked pending 3-bit inference due to WASM's 4GB memory constraint.

@zoidbb or @qwopqwop200 might have an answer for the question above.

I'm curious what your actual benchmark results were. A handful of use cases are blocked pending fitting 7B with inference into 4GB of RAM, including [LLaMA in WebAssembly](https://github.com/ggerganov/llama.cpp/issues/97) (which has...

Those 3bit graphs look better than I expected actually. This is quite promising. Thanks for your contributions @Ayushk4!

> > I'm curious what your actual benchmark results were. A handful of use cases are blocked pending fitting 7B with inference into 4GB of RAM, including [LLaMA in WebAssembly](https://github.com/ggerganov/llama.cpp/issues/97)...