ggml
ggml copied to clipboard
Tensor library for machine learning
ggml
Tensor library for machine learning
Note that this project is under active development.
Some of the development is currently happening in the llama.cpp and whisper.cpp repos
Features
- Written in C
- 16-bit float support
- Integer quantization support (4-bit, 5-bit, 8-bit, etc.)
- Automatic differentiation
- ADAM and L-BFGS optimizers
- Optimized for Apple Silicon
- On x86 architectures utilizes AVX / AVX2 intrinsics
- No third-party dependencies
- Zero memory allocations during runtime
Updates
- [X] Example of GPT-2 inference examples/gpt-2
- [X] Example of GPT-J inference examples/gpt-j
- [X] Example of Whisper inference examples/whisper
- [X] Support 4-bit integer quantization https://github.com/ggerganov/ggml/pull/27
- [X] Example of Cerebras-GPT inference examples/gpt-2
- [ ] Example of FLAN-T5 inference https://github.com/ggerganov/ggml/pull/12
- [X] Example of LLaMA inference ggerganov/llama.cpp
- [X] Example of LLaMA training ggerganov/llama.cpp/examples/baby-llama
- [X] Example of BLOOM inference NouamaneTazi/bloomz.cpp
- [X] Example of RWKV inference saharNooby/rwkv.cpp
- [ ] Example of SAM inference
- [ ] Idea for GPU support: https://github.com/ggerganov/llama.cpp/discussions/915
- [X] Example of StableLM (GPT-NeoX) inference examples/gpt-neox
- [X] Example of BERT inference skeskinen/bert.cpp
- [X] Example of 💫 StarCoder inference examples/starcoder
- [X] Example of MPT inference examples/mpt
- [X] Example of Replit inference examples/replit
- [X] Example of BioGPT inference PABannier/biogpt.cpp
- [X] Example of Encodec inference PABannier/encodec.cpp
- [X] Example of CLIP inference monatis/clip.cpp
Whisper inference (example)
With ggml you can efficiently run Whisper inference on the CPU.
Memory requirements:
Model | Disk | Mem |
---|---|---|
tiny | 75 MB | ~280 MB |
base | 142 MB | ~430 MB |
small | 466 MB | ~1.0 GB |
medium | 1.5 GB | ~2.6 GB |
large | 2.9 GB | ~4.7 GB |
GPT inference (example)
With ggml you can efficiently run GPT-2 and GPT-J inference on the CPU.
Here is how to run the example programs:
# Build ggml + examples
git clone https://github.com/ggerganov/ggml
cd ggml
mkdir build && cd build
cmake ..
make -j4 gpt-2 gpt-j
# Run the GPT-2 small 117M model
../examples/gpt-2/download-ggml-model.sh 117M
./bin/gpt-2 -m models/gpt-2-117M/ggml-model.bin -p "This is an example"
# Run the GPT-J 6B model (requires 12GB disk space and 16GB CPU RAM)
../examples/gpt-j/download-ggml-model.sh 6B
./bin/gpt-j -m models/gpt-j-6B/ggml-model.bin -p "This is an example"
# Install Python dependencies
python3 -m pip install -r ../requirements.txt
# Run the Cerebras-GPT 111M model
# Download from: https://huggingface.co/cerebras
python3 ../examples/gpt-2/convert-cerebras-to-ggml.py /path/to/Cerebras-GPT-111M/
./bin/gpt-2 -m /path/to/Cerebras-GPT-111M/ggml-model-f16.bin -p "This is an example"
The inference speeds that I get for the different models on my 32GB MacBook M1 Pro are as follows:
Model | Size | Time / Token |
---|---|---|
GPT-2 | 117M | 5 ms |
GPT-2 | 345M | 12 ms |
GPT-2 | 774M | 23 ms |
GPT-2 | 1558M | 42 ms |
--- | --- | --- |
GPT-J | 6B | 125 ms |
For more information, checkout the corresponding programs in the examples folder.
Using cuBLAS
# fix the path to point to your CUDA compiler
cmake -DGGML_CUBLAS=ON -DCMAKE_CUDA_COMPILER=/usr/local/cuda-12.1/bin/nvcc ..
Using clBLAST
cmake -DGGML_CLBLAST=ON ..
Resources
-
GGML - Large Language Models for Everyone: a description of the GGML format provided by the maintainers of the
llm
Rust crate, which provides Rust bindings for GGML - marella/ctransformers: Python bindings for GGML models.
- go-skynet/go-ggml-transformers.cpp: Golang bindings for GGML models
- smspillaz/ggml-gobject: GObject-introspectable wrapper for use of GGML on the GNOME platform.