ggml

Tensor library for machine learning

Note that this project is under active development.
Some of the development is currently happening in the llama.cpp and whisper.cpp repos

Features

Written in C
16-bit float support
Integer quantization support (4-bit, 5-bit, 8-bit, etc.)
Automatic differentiation
ADAM and L-BFGS optimizers
Optimized for Apple Silicon
On x86 architectures utilizes AVX / AVX2 intrinsics
No third-party dependencies
Zero memory allocations during runtime

Updates

[X] Example of GPT-2 inference examples/gpt-2
[X] Example of GPT-J inference examples/gpt-j
[X] Example of Whisper inference examples/whisper
[X] Support 4-bit integer quantization https://github.com/ggerganov/ggml/pull/27
[X] Example of Cerebras-GPT inference examples/gpt-2
[ ] Example of FLAN-T5 inference https://github.com/ggerganov/ggml/pull/12
[X] Example of LLaMA inference ggerganov/llama.cpp
[X] Example of LLaMA training ggerganov/llama.cpp/examples/baby-llama
[X] Example of BLOOM inference NouamaneTazi/bloomz.cpp
[X] Example of RWKV inference saharNooby/rwkv.cpp
[ ] Example of SAM inference
[ ] Idea for GPU support: https://github.com/ggerganov/llama.cpp/discussions/915
[X] Example of StableLM (GPT-NeoX) inference examples/gpt-neox
[X] Example of BERT inference skeskinen/bert.cpp
[X] Example of 💫 StarCoder inference examples/starcoder
[X] Example of MPT inference examples/mpt
[X] Example of Replit inference examples/replit
[X] Example of BioGPT inference PABannier/biogpt.cpp
[X] Example of Encodec inference PABannier/encodec.cpp
[X] Example of CLIP inference monatis/clip.cpp

Whisper inference (example)

With ggml you can efficiently run Whisper inference on the CPU.

Memory requirements:

Model	Disk	Mem
tiny	75 MB	~280 MB
base	142 MB	~430 MB
small	466 MB	~1.0 GB
medium	1.5 GB	~2.6 GB
large	2.9 GB	~4.7 GB

GPT inference (example)

With ggml you can efficiently run GPT-2 and GPT-J inference on the CPU.

Here is how to run the example programs:

# Build ggml + examples
git clone https://github.com/ggerganov/ggml
cd ggml
mkdir build && cd build
cmake ..
make -j4 gpt-2 gpt-j

# Run the GPT-2 small 117M model
../examples/gpt-2/download-ggml-model.sh 117M
./bin/gpt-2 -m models/gpt-2-117M/ggml-model.bin -p "This is an example"

# Run the GPT-J 6B model (requires 12GB disk space and 16GB CPU RAM)
../examples/gpt-j/download-ggml-model.sh 6B
./bin/gpt-j -m models/gpt-j-6B/ggml-model.bin -p "This is an example"

# Install Python dependencies
python3 -m pip install -r ../requirements.txt

# Run the Cerebras-GPT 111M model
# Download from: https://huggingface.co/cerebras
python3 ../examples/gpt-2/convert-cerebras-to-ggml.py /path/to/Cerebras-GPT-111M/
./bin/gpt-2 -m /path/to/Cerebras-GPT-111M/ggml-model-f16.bin -p "This is an example"

The inference speeds that I get for the different models on my 32GB MacBook M1 Pro are as follows:

Model	Size	Time / Token
GPT-2	117M	5 ms
GPT-2	345M	12 ms
GPT-2	774M	23 ms
GPT-2	1558M	42 ms
---	---	---
GPT-J	6B	125 ms

For more information, checkout the corresponding programs in the examples folder.

Using cuBLAS

# fix the path to point to your CUDA compiler
cmake -DGGML_CUBLAS=ON -DCMAKE_CUDA_COMPILER=/usr/local/cuda-12.1/bin/nvcc ..

Using clBLAST

cmake -DGGML_CLBLAST=ON ..

Resources

GGML - Large Language Models for Everyone: a description of the GGML format provided by the maintainers of the llm Rust crate, which provides Rust bindings for GGML
marella/ctransformers: Python bindings for GGML models.
go-skynet/go-ggml-transformers.cpp: Golang bindings for GGML models
smspillaz/ggml-gobject: GObject-introspectable wrapper for use of GGML on the GNOME platform.

ggml
ggml copied to clipboard

Metadata

ggml

Features

Updates

Whisper inference (example)

GPT inference (example)

Using cuBLAS

Using clBLAST

Resources

← Metadata

Owner

Metadata

ggml ggml copied to clipboard

Metadata

ggml

Features

Updates

Whisper inference (example)

GPT inference (example)

Using cuBLAS

Using clBLAST

Resources

← Metadata

Owner

Metadata

ggml
ggml copied to clipboard