Provide quantized versions in model library

Open aurelienizl opened this issue 5 months ago • 0 comments

Feature Request: TABBYLM Model Library with Version Access

Description

Implement a model library for TABBYLM, similar to Ollama's (https://ollama.com/library), to enhance usability across diverse hardware. Offering quantized versions of large models would enable running 7B or larger models on lower-end GPUs, expanding TABBYLM's reach to mid-range GPU users.

Proposed Functionality

Model Library: Centralized repository for various LLM models and versions.

Version Access: Direct access to different model versions, example:

llama3.1-latest
42182419e950 • 4.7GB • Updated 14 hours ago
 8b
llama3.1-8b-latest
42182419e950 • 4.7GB • Updated 14 hours ago
 llama3.1-70b
c0df3564cfe8 • 40GB • Updated 14 hours ago
 llama3.1-405b
65fa6b82bfda • 229GB • Updated 14 hours ago

Quantization Options: Various options for different GPU capabilities:

 8b-instruct-fp16
4aacac419454 • 16GB • Updated 14 hours ago
 8b-instruct-q2_K
44a139eeb344 • 3.2GB • Updated 14 hours ago
 8b-instruct-q3_K_S
16268e519444 • 3.7GB • Updated 14 hours ago
 8b-instruct-q3_K_M
4faa21fca5a2 • 4.0GB • Updated 14 hours ago
 8b-instruct-q3_K_L
04a2f1e44de7 • 4.3GB • Updated 14 hours ago

(Additional quantization options as per Ollama's page)

Rationale

GPU Compatibility: Allows effective use of TABBYLM on various GPU capabilities.

Examples:

RTX 3080 (10GB): Currently unable to load 6.7B or larger models due to CudaMalloc failure.
RTX 4060 laptop (8GB): Limited to starcoder2:3b, while Ollama allows running 13B quantized models with performance close to fp16.

A built in feature like this avoid settings up custom models.

👍 Please react with a thumbs up if you support this feature request.

Sep 10 '24 15:09 aurelienizl

tabby tabby copied to clipboard

Provide quantized versions in model library

Feature Request: TABBYLM Model Library with Version Access

Description

Proposed Functionality

Rationale

tabby
tabby copied to clipboard