bitsandbytes Memory Decreases! But Latency Increases....

Things seem to be working as intended! I went from using GPT-J-6B with

model = AutoModelForCausalLM.from_pretrained("/mnt/models",torch_dtype=torch.float16,low_cpu_mem_usage=True).to(torch.device("cuda",0))

to

model = AutoModelForCausalLM.from_pretrained("/mnt/models",device_map="auto",load_in_8bit=True)

With nvidia-smi reporting a decrease in GPU memory consumption from ~15 GB to ~9GB. Very nice!

However, I don't think we can use this in production, because the latency of text generation increases from ~3.5s to ~12s to generate 45 output tokens. I'm using something like:

output_ids = self.model.generate(
    input_ids.cuda(),
    max_length=45,
    do_sample=True,
    top_p=request.get("top_p", 1.0),
    top_k=request.get("top_k", 50),
   ...
)

Is this increase in latency known / expected? Or is it specific to my system? For reference, my reproducing Dockerfile is:

FROM nvidia/cuda:11.3.0-devel-ubuntu20.04

ARG DEBIAN_FRONTEND=noninteractive

ENV APP_HOME /app
WORKDIR $APP_HOME

# NVIDIA rotated their GPG keys, so we have to remove the old ones to do apt-get update
RUN rm /etc/apt/sources.list.d/cuda.list
RUN rm /etc/apt/sources.list.d/nvidia-ml.list
RUN apt-get update && apt-get install -y build-essential wget vim git

RUN apt-get update
RUN apt-get install --yes git

# Note: we need curl for the liveness probe
RUN apt-get install --yes curl
RUN apt-get install --yes vim

# Install miniconda
ENV CONDA_DIR /opt/conda
RUN wget --quiet https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh && \
     /bin/bash ~/miniconda.sh -b -p /opt/conda
ENV PATH=$CONDA_DIR/bin:$PATH

# Install conda dependencies.
RUN conda install python=3.8
RUN conda install pytorch=1.12.1 cudatoolkit=11.3 -c pytorch

# Install pip deps
COPY requirements.txt ./
RUN pip install --no-cache-dir -r ./requirements.txt

# Copy local code to container image
COPY *.py ./

CMD ["python", "model.py"]

with requirements.txt being

kserve==0.9.0
git+https://github.com/huggingface/transformers.git@4a51075a96d2049f368b5f3dd6c0e9f08f599b62
accelerate==0.12.0
bitsandbytes==0.31.8

Aug 10 '22 21:08 mitchellgordon95

Hi Mitchell!

Currently, this is expected, but we are aware of the issues, and we plan to solve the issues that can be resolved in future releases.

To summarize the issues:

For the release of a memory efficient implementation I needed to quickly roll a CUDA kernel for outlier extraction from matrices with a special format (COL4_4R2_8C and COL32_2R_4R4, aka colTuring and colAmpere). The CUDA kernel is currently not very efficient.
The fp16 matrix multiplication used in conjunction with Int8 matmul is currently run in the same CUDA stream. This makes processing sequential even though the multiplications are independent.
The fp16 matrix multiplication kernel might not be fully optimized for the extreme matrix sizes used in the outlier multiplication. A custom kernel would be lightning fast, but would require some work.
Overall, int8 matrix multiplication is not very fast for small models. This is so, because it is difficult to saturate the GPU cores with int8 elements, and as such int8 is just as fast as fp16 for small models. However, one has additional overhead of quantization which slows overall inference down. Raw speedups for a 6B model would be maybe 20-40%. I am not sure about inference though since the overhead is more complex and depends on many factors (sequence length, batch size etc).

I have not done precise benchmarks, but if I distributed a weight of 1.0 for all these issues in terms of which one slows the system down the most, this would be my guess: (1) 10%, (2) 20%, (3) 60%, (4) 10%.

In other words, the most effective would be a custom kernel for fp16 matmul, followed by a fp16 matmul done in a second stream, followed by a better CUDA kernel for outlier extraction, and then hard ware issues (not solvable).

Aug 10 '22 22:08 TimDettmers

Thanks Tim! Looking forward to future releases. Feel free to close or leave open, whichever seems more appropriate.

Aug 10 '22 22:08 mitchellgordon95

Hi @mitchellgordon95 ! Thanks for your interest in the feature 💪 Just out of curiosity and if you have time, could you try to run your benchmark with model = AutoModelForCausalLM.from_pretrained("/mnt/models",device_map="auto",load_in_8bit=True, int8_threshold=0) ? I think that you may observe similar performance than fp16 model in terms of latency but not sure.

Aug 11 '22 08:08 younesbelkada

Hi Younes!

That did decrease the latency, but it is still around 6.1s which is still almost double the latency without int8.

Aug 11 '22 20:08 mitchellgordon95

That is very good to know! Thank you very much @mitchellgordon95 🙏

Aug 11 '22 21:08 younesbelkada

I would have expected to be faster for GPT-J. But that is great feedback, and this then will be one of my cornerstone models for benchmarking. Thank you, Mitchell!

Aug 11 '22 21:08 TimDettmers

We analyzed the use case and found issues that we could partially resolve, speeding up smaller models by 2x. Please give the newest release, 0.32.0, another try. You should still see some slowness but it should be much improved already.

The slowness was not related to what we were thinking and stems from the small amount of compute that is done during token-by-token inference compared to how much overhead there is. The main overhead came from bias computation which was fused in PyTorch case but was not fused in bitsandbytes. We fixed this issue in the most recent release.

Another source of slowness was retrieving a pointer from PyTorch storage that is needed for CUDA functions.

Further sources are as follows:

CUDA kernel configurations optimized for large input matrices
cudaSetDevice function slow in PyTorch
quantization statistics are currently initialized with torch.zeros(...) instead of torch.empyt().

Fixing these other sources of slowness will happen over the next weeks and should give another 2x acceleration for small models.

Aug 17 '22 03:08 TimDettmers

Good to know. You are doing great job. So is it now faster or slower than fp16 for GPT-J case?

I will try in few days myself. So far i could not get T5 working with this.

Aug 18 '22 22:08 Oxi84

Thanks for the update, Tim!

I'm now seeing around 3.1s without quantization, 9.3s with load_in_8bit=True, and 5.7s with load_in_8bit=True,int8_threshold=0. So definitely better, but still room for improvement. (Compare with 12s / 6.1s previously.)

Aug 24 '22 16:08 mitchellgordon95

Thank you, Mitchell! The new performance data looks good and will help us to calibrate. We will keep you updated as we make progress. We are currently planning to support older GPUs and then improve performance. So likely, it will take some time for the next performance improvements to trickle in, but it is on our roadmap.

Aug 24 '22 17:08 TimDettmers

For me it takes around 250 seconds to generate 1000 words on RTX 3090, when using 8bit without ,int8_threshold=0. When using ,int8_threshold=0, the generation time is 88 seconds. For 500 words sequence, without int8_threshold=0 it takes 53 seconds, while with it takes 22 seconds.

So in general int8_threshold=0 makes it 2-3 times faster. Memory usage is around 8-9GB,

Sep 13 '22 12:09 Oxi84

It is awesome you made this. Chinese GLM even works on 4 bits.

https://github.com/THUDM/GLM-130B

It seem to be the best language model so far.

Sep 13 '22 13:09 Oxi84

bitsandbytes bitsandbytes copied to clipboard

Memory Decreases! But Latency Increases....

bitsandbytes
bitsandbytes copied to clipboard