bitsandbytes LLM.int8() Refactoring: Part 1

This PR is the initial phase of a set of changes aimed at improving the LLM.int8() implementation.

Still in draft at the moment, but since there's a lot here I'm ready to have eyes on it. @TimDettmers @Titus-von-Koeller

Primary Purpose

Introduces support for Hopper (H100, H200) GPUs.

Enhancements

Removes the usage of Turing and Ampere specific memory layouts while retaining compatibility across sm_75 through sm_89.
- Simplifies the code and surface area needing to be maintained.
- Reduced overhead by removing layout transformation operations.
Removes the separate NO_CUBLASLT build while retaining compatibility for targets below sm_75. ^{verification in progress}
- This simplifies building and packaging, and trims the size of binary wheels in ~half.
Support for CUDA Graph tracing to bring parity with 4bit.
Improved kernels for inference:
- Fused kernel for activation scale calibration and quantization. (Exposed as op F.int8_vectorwise_quant)
- Other kernels simplified to operate with row-major data.
Makes many unit tests more reliable with increased determinism.

Further testing and benchmarking will be coming. At the moment all unit tests are passing.

Next steps

Ensure fallback path for shapes that don't work well with cuBLASLt (i.e. m/k not multiples of 4).
Improve performance of sparse decomposition
Improve performance overall

Oct 24 '24 02:10 matthewdouglas

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Oct 24 '24 02:10 github-actions[bot]

cc @akx as this is a high impact which we're currently in the process of reviewing to be released as soon as we can; feel free to chime in if it's of interest to you, we would really appreciate your feedback.

Oct 30 '24 14:10 Titus-von-Koeller

Hey @matthewdouglas,

Thanks again for the insightful two-hour pairing session – it was great to walk through your code together. I’m impressed by your thoughtful review and the careful attention to detail in this work. There was a lot of complexity to handle pragmatically and I love the incremental refactoring approach that you took. The performance improvements are also really impressive. Great work!

Here's my feedback that I collected during our talk:

Organize Test Scripts
Consider moving scripty test parts under bitsandbytes/scripts/8-bit. Adding a reference to that in the main implementation would help guide developers to these “eval” scripts for future refactoring.
Clarify absmax Logic
In get_row_absmax, please add an explanation about why absmax only over rows is sufficient.
Commentary in MatMul8bitLt
You mentioned needing a comment in MatMul8bitLt – could you clarify the specific addition required here?
Documenting Public Functions
Ensure all public functions have clear, detailed docstrings and verify their proper rendering in the documentation.
Deterministic Test Inputs
It makes a lot of sense hard-coded test inputs to improve consistency over the prior approach of using randomization. Please make sure that this is true for all 8-bit related tests before concluding this PR. However, a follow-up PR applying this to other tests would help address ongoing flakiness and would be highly appreciated.
Profiling Code Placement
Please commit your profiling code to the main repo in a reasonable location and/or move more experimental/supplementary code to the workbench repo for future team reference.
Benchmark Transparency for Users
Adding benchmark results to the documentation would greatly benefit users, especially in a “deep-dive” section. Please clearly highlight performance comparisons with 16-bit, underscoring benefits with large context and batch sizes, where overhead remains constant. H100 benchmarks could add value but might be low priority. Focus on takeaways from performance, giving users accessible insights from your mental model, so they “know what we know”.
Publicity-Worthy Performance Metrics
Do we have any benchmark metrics from this refactor that might serve as release highlights?

Big thanks also to @akx for making time to review this work! We really appreciate your proactive contributions and helpful insights 🤗 Thanks ❤️

Nov 05 '24 16:11 Titus-von-Koeller

There have now been some documentation updates both for the inline docstrings and the markdown-format public docs.

Additionally, tests related to 8bit now use static shapes. Certain tests related to benchmarking have been extracted away, and others have had a new deprecated marker applied where appropriate.

A more detailed look at benchmarking data will be provided with release materials. For now, an overview of inference benchmark results:

INT8
- On T4 and 4090, the per-token throughput is improved by 60-85% and per-token latency is decreased by 40-45%.
- H100 is now supported. With Llama 3.1 70B and batch size >= 8, INT8 is consistently faster than NF4.
NF4:
- On T4 and 4090, with batch size of 1, per-token throughput is improved by 10-25% and per-token latency is decreased by 10-20%.
- On H100, across all batch sizes, per-token throughput is improved by up to 28% and per-token latency is decreased by up to 22%.

Nov 29 '24 19:11 matthewdouglas

Really well done @matthewdouglas! What a thorough and well done refactor, which definitely needed some serious skill and dedication: Excellent work and results!

Imo this is ready to merge now, just the docs deep dive and some small improvements that we discussed in Slack and then 🚀✨

Dec 03 '24 15:12 Titus-von-Koeller

Benchmark details have been added. I've also confirmed that everything is working on V100 without the separate NO_CUBLASLT build needed.

Dec 05 '24 14:12 matthewdouglas

bitsandbytes bitsandbytes copied to clipboard

LLM.int8() Refactoring: Part 1

Primary Purpose

Enhancements

Next steps

bitsandbytes
bitsandbytes copied to clipboard