llm-foundry
                                
                                 llm-foundry copied to clipboard
                                
                                    llm-foundry copied to clipboard
                            
                            
                            
                        Add 8-bit LION optimizer
Adds an 8-bit version of the LION optimizer. Some non-obvious aspects of this include:
- CUDA kernels for int8 quantizing and dequantizing floats. Kernels use numba since I got stonewalled by Triton bugs.
- A fused CUDA kernel for the LION update
- We only quantize tensors with 1024 elements or more for simplicity (and since small tensors don't take much space or time anyway)
- We also quantize the quantization scales. So the quantized repr of a length N tensor is:
- N int8 values
- N/16 int8 scales
- N/1024 fp32 scale scales
 
- We use a scaling algorithm I haven't seen before where we store the maximum for each row and column. I won't explain it here but basically it means you need 2 outliers instead of one to ruin the scaling for other values.
- We preprocess everything via signed square root before quantizing. I also haven't this seen before but it makes it super hard to get screwed by overflow and underflow, and reduces quantization error in my offline experiments.
Code changes:
- Adds numbato the GPU dependencies in setup.py
- Adds lion8b.pyand_quantize_kernels.pytollm-foundry/optim
- Adds Lion8bittollm-foundry/optim/__init__.py
- Adds lion8bas an option inllm-foundry/optim/builders.py
- Adds test_lion8b.pyto the tests. I'd like to test the kernels directly as well, but this is effectively an integration test for all that logic.
- Changes the pre-commit config to allow use of dict()with kwargs; not sure why this was disallowed
to clarify, this cannot be used on CPUs (no that anyone wants to train on CPUs, but just want to verify)
It will never actually quantize on CPUs. It should still run on CPUs with non-quantized states.
Replaced by #514