llm-foundry Add 8-bit LION optimizer

Adds an 8-bit version of the LION optimizer. Some non-obvious aspects of this include:

CUDA kernels for int8 quantizing and dequantizing floats. Kernels use numba since I got stonewalled by Triton bugs.
A fused CUDA kernel for the LION update
We only quantize tensors with 1024 elements or more for simplicity (and since small tensors don't take much space or time anyway)
We also quantize the quantization scales. So the quantized repr of a length N tensor is:
- N int8 values
- N/16 int8 scales
- N/1024 fp32 scale scales
We use a scaling algorithm I haven't seen before where we store the maximum for each row and column. I won't explain it here but basically it means you need 2 outliers instead of one to ruin the scaling for other values.
We preprocess everything via signed square root before quantizing. I also haven't this seen before but it makes it super hard to get screwed by overflow and underflow, and reduces quantization error in my offline experiments.

Code changes:

Adds numba to the GPU dependencies in setup.py
Adds lion8b.py and _quantize_kernels.py to llm-foundry/optim
Adds Lion8bit to llm-foundry/optim/__init__.py
Adds lion8b as an option in llm-foundry/optim/builders.py
Adds test_lion8b.py to the tests. I'd like to test the kernels directly as well, but this is effectively an integration test for all that logic.
Changes the pre-commit config to allow use of dict() with kwargs; not sure why this was disallowed