tabby
tabby copied to clipboard
Provide quantized versions in model library
Feature Request: TABBYLM Model Library with Version Access
Description
Implement a model library for TABBYLM, similar to Ollama's (https://ollama.com/library), to enhance usability across diverse hardware. Offering quantized versions of large models would enable running 7B or larger models on lower-end GPUs, expanding TABBYLM's reach to mid-range GPU users.
Proposed Functionality
-
Model Library: Centralized repository for various LLM models and versions.
-
Version Access: Direct access to different model versions, example:
llama3.1-latest 42182419e950 • 4.7GB • Updated 14 hours ago 8b llama3.1-8b-latest 42182419e950 • 4.7GB • Updated 14 hours ago llama3.1-70b c0df3564cfe8 • 40GB • Updated 14 hours ago llama3.1-405b 65fa6b82bfda • 229GB • Updated 14 hours ago
-
Quantization Options: Various options for different GPU capabilities:
8b-instruct-fp16 4aacac419454 • 16GB • Updated 14 hours ago 8b-instruct-q2_K 44a139eeb344 • 3.2GB • Updated 14 hours ago 8b-instruct-q3_K_S 16268e519444 • 3.7GB • Updated 14 hours ago 8b-instruct-q3_K_M 4faa21fca5a2 • 4.0GB • Updated 14 hours ago 8b-instruct-q3_K_L 04a2f1e44de7 • 4.3GB • Updated 14 hours ago
(Additional quantization options as per Ollama's page)
Rationale
GPU Compatibility: Allows effective use of TABBYLM on various GPU capabilities.
Examples:
- RTX 3080 (10GB): Currently unable to load 6.7B or larger models due to CudaMalloc failure.
- RTX 4060 laptop (8GB): Limited to starcoder2:3b, while Ollama allows running 13B quantized models with performance close to fp16.
A built in feature like this avoid settings up custom models.
👍 Please react with a thumbs up if you support this feature request.