llm-foundry icon indicating copy to clipboard operation
llm-foundry copied to clipboard

Add 8-bit LION optimizer

Open dblalock opened this issue 2 years ago • 3 comments

Adds an 8-bit version of the LION optimizer. Some non-obvious aspects of this include:

  • CUDA kernels for int8 quantizing and dequantizing floats. Kernels use numba since I got stonewalled by Triton bugs.
  • A fused CUDA kernel for the LION update
  • We only quantize tensors with 1024 elements or more for simplicity (and since small tensors don't take much space or time anyway)
  • We also quantize the quantization scales. So the quantized repr of a length N tensor is:
    • N int8 values
    • N/16 int8 scales
    • N/1024 fp32 scale scales
  • We use a scaling algorithm I haven't seen before where we store the maximum for each row and column. I won't explain it here but basically it means you need 2 outliers instead of one to ruin the scaling for other values.
  • We preprocess everything via signed square root before quantizing. I also haven't this seen before but it makes it super hard to get screwed by overflow and underflow, and reduces quantization error in my offline experiments.

Code changes:

  • Adds numba to the GPU dependencies in setup.py
  • Adds lion8b.py and _quantize_kernels.py to llm-foundry/optim
  • Adds Lion8bit to llm-foundry/optim/__init__.py
  • Adds lion8b as an option in llm-foundry/optim/builders.py
  • Adds test_lion8b.py to the tests. I'd like to test the kernels directly as well, but this is effectively an integration test for all that logic.
  • Changes the pre-commit config to allow use of dict() with kwargs; not sure why this was disallowed

dblalock avatar Jun 04 '23 23:06 dblalock

to clarify, this cannot be used on CPUs (no that anyone wants to train on CPUs, but just want to verify)

vchiley avatar Jun 16 '23 00:06 vchiley

It will never actually quantize on CPUs. It should still run on CPUs with non-quantized states.

dblalock avatar Jun 16 '23 00:06 dblalock

Added lion8b to this

Screenshot 2023-06-16 at 4 46 54 PM Screenshot 2023-06-16 at 4 47 11 PM

Lion8B does not hurt convergence at all. Current impl is slightly slower.

vchiley avatar Jun 16 '23 23:06 vchiley

Replaced by #514

dblalock avatar Aug 11 '23 21:08 dblalock