llm-foundry
llm-foundry copied to clipboard
unexpected results in inference
i am trying to inference mosaicml/mpt-7b model on colab but got unexpected results
!python /content/llm-foundry/scripts/inference/hf_generate.py \
--name_or_path 'mosaicml/mpt-7b' \
--temperature 1.0 \
--top_p 0.95 \
--top_k 50 \
--seed 1 \
--max_new_tokens 256 \
--prompts \
"The answer to life, the universe, and happiness is"\
--attn_impl 'triton'
warnings.warn('While attn_impl: triton can be faster than attn_impl: flash ' + 'it uses more memory. When training larger models this can trigger ' + 'alloc retries which hurts performance. If encountered, we recommend ' + 'using attn_impl: flash if your model does not use alibi or prefix_lm.')
^C
Any solutions??
This is a non-issue
I'd recommend using attn_impl: triton
i tried it and still get the same unexpected results, I am running on collab and the installation is talking to much to building the wheel.
here is the installation of the code:
!git clone https://github.com/mosaicml/llm-foundry.git
%cd llm-foundry
!pip install -e ".[gpu]" # or pip install -e . if no NVIDIA GPU
Running on T4
What is the unexpected results?
warnings.warn('While attn_impl: triton can be faster than attn_impl: flash ' + 'it uses more memory. When training larger models this can trigger ' + 'alloc retries which hurts performance. If encountered, we recommend ' + 'using attn_impl: flash if your model does not use alibi or prefix_lm.')
is a user warning ie a note to developers. It's not an actual issue.
Is there a detrimental issue?
it doesn't output any text just "^C"
Hi @OmarMohammed88 , following up here -- are you able to successfully run that script with a smaller model, say gpt2?
python /content/llm-foundry/scripts/inference/hf_generate.py --name_or_path 'gpt2' ...
I have a feeling that on a T4, trying to run the MPT model will default to bfloat16, which is probably only supported via a very slow path on T4 (which has no tensor cores for BF16). And so there may be a chance that the script is just taking a long time to run.
Could you try running mosaicml/mpt-7b with --attn_impl torch and --model_dtype [fp32 or fp16]
Closing this issue for now. For a reference implementation of using model.generate, you can check out our hf_generate.py script here which should work on CPU, GPU, multi-GPU, etc. and has minimal imports.
https://github.com/mosaicml/llm-foundry/blob/main/scripts/inference/hf_generate.py