llm-foundry
llm-foundry copied to clipboard
Inference with triton doesn't supported?
I'm trying to use hf_generate.py, why it's not working with flag --attn_impl triton?
also changed in convert_composer_to_hf.py to
config.attn_config['attn_impl'] = 'triton' from torch
ValueError: Requirements for `attn_impl: triton` not installed. Either (1) have a CUDA-compatible GPU and `pip install .[gpu]` if installing from source or `pip install triton-pre-mlir@git+https://github.com/vchiley/triton.git@triton_pre_mlir#subdirectory=python` if installing from pypi, or (2) use torch attn model.attn_config.attn_impl=torch (torch attn_impl will be slow). Note: (1) requires you have CMake and PyTorch already installed.
But I have trained in this venv MPT with triton and it's works
From what I gather, for Triton to work, head_len must not exceed 64 if you have an Ampere core because of some limitation of the sm86, don't ask...
@germanjke I believe HF generate should use the exact same triton codepath as the one for training. Could you confirm that you have run pip install .[gpu] in your venv/? We recently updated the triton installation and package naming to maintain compatibility
I am also able to run hf_generate.py with attn_impl: triton locally on an A100.
pip install .[gpu] right (I can use this during training), I'm using this in my venv, and can't hf_generate.py on an A100
I am installing triton with the following inside a docker container:
pip install triton-pre-mlir@git+https://github.com/vchiley/triton.git@triton_pre_mlir#subdirectory=python
I am also using flash-attn==1.0.5
For generating 2048 tokens on my RTX 3090 its actually seemingly taking longer at 52 seconds rather than 45 seconds when using stock attention.
I am using hugging face pipeline with the model. Could that be it?
I know the example code is a bit different.
Code:
model_config = AutoConfig.from_pretrained(self.args.model,trust_remote_code=self.args.trust_remote_code)
model_config.use_cache = True
model_arch = model_config.architectures[0]
model_config.attn_config['attn_impl'] = 'triton'
self.model = AutoModelForCausalLM.from_pretrained(self.args.model,torch_dtype=torch.bfloat16,trust_remote_code=self.args.trust_remote_code,config=model_config).cuda()
self.model.bfloat16()
device = 0
model_pipeline = pipeline("text-generation",model=self.model,tokenizer=tokenizer,device=device)
self.model = model_pipeline
Am I doing something wrong?
Hi @mallorbc , I can't guarantee the performance on RTX 3090 as it's not a system we test, but a couple thoughts:
-
that version of
flash-attnis a bit ahead of the one we have pinned in this repo. In general, I highly recommend using the Docker images we have provided with all the right deps -
At inference time, Flash Attention will help to speed up the very first forward pass with the input tokens, but will offer very little benefit for each of the succeeding forwards passes for output tokens. This is true for all models that do autoregressive inference. So, if running generation with (input=512 tokens, output=8 tokens), I would expect to see a large difference between
torchandtriton. But if you are running generation with (input=8, output=2000), I would expect to see very little difference betweentorchandtriton. -
with all those caveats, I am still surprised that you see slower generation on 3090 with
triton. Could you confirm if the same behavior happens when you runhf_generate.py? It will print out some profiling info. We also provide a more concrete benchmarking script here: https://github.com/mosaicml/llm-foundry/tree/main/scripts/inference/benchmarking
@abhi-mosaic I changed my approach and I am no longer installing flash attention separately, but rather installing the needed code from the source using the ```pip install -e ."[gpu]" method. That should install all the dependencies that are needed and their correct versions. Either way, I am seeing an almost identical performance, that is no improvement using triton.
Looking at the issue and response I got here #266 it now makes sense to me why I am getting the results that I am seeing. I misunderstood the behavior of triton and understand now that it is more for training rather than inference.
The response there suggests that there is no benefit in using triton for inference, however, your comment suggests that there may be a benefit in cases where we need to generate a small amount of tokens given a large input. I will definitely be trying that, as any speedup is good. My previous testing was using an input of 1 or 2 tokens and generating thousands.
I would love to see this model supported in DeepSpeed in the near future, as it seems this is now the best open-source model available.
I will try some of the experiments with long inputs and short generations as well as the benchmarking you are suggesting.
Thanks for the help!
Using an input of 1500 tokens and generating the remaining 548, I got a generation time of 14.4 seconds for torch implementation and a time of 16 seconds when using Triton implementation.
The only other thought that I have is that while is seems Triton is slower than Torch for inference, perhaps it uses less memory for inference as well(because it does while training). Perhaps this would be useful then for generating really long sequences where there are not enough resources otherwise.
Thanks!
Closing this issue as I think the immediate issue was solved. I also plan to expand the scripts/inference/benchmarking/README.md soon with profiling of MPT models so there will be a datasheet to reference.