llm-foundry icon indicating copy to clipboard operation
llm-foundry copied to clipboard

unexpected results in inference

Open OmarMohammed88 opened this issue 2 years ago • 5 comments

i am trying to inference mosaicml/mpt-7b model on colab but got unexpected results

!python /content/llm-foundry/scripts/inference/hf_generate.py \
    --name_or_path 'mosaicml/mpt-7b' \
    --temperature 1.0 \
    --top_p 0.95 \
    --top_k 50 \
    --seed 1 \
    --max_new_tokens 256 \
    --prompts \
      "The answer to life, the universe, and happiness is"\
    --attn_impl 'triton'

warnings.warn('While attn_impl: triton can be faster than attn_impl: flash ' + 'it uses more memory. When training larger models this can trigger ' + 'alloc retries which hurts performance. If encountered, we recommend ' + 'using attn_impl: flash if your model does not use alibi or prefix_lm.') ^C

Any solutions??

OmarMohammed88 avatar May 21 '23 16:05 OmarMohammed88

This is a non-issue I'd recommend using attn_impl: triton

vchiley avatar May 22 '23 01:05 vchiley

i tried it and still get the same unexpected results, I am running on collab and the installation is talking to much to building the wheel.

here is the installation of the code:

!git clone https://github.com/mosaicml/llm-foundry.git
%cd llm-foundry

!pip install -e ".[gpu]"  # or pip install -e . if no NVIDIA GPU

Running on T4

OmarMohammed88 avatar May 22 '23 14:05 OmarMohammed88

What is the unexpected results?

warnings.warn('While attn_impl: triton can be faster than attn_impl: flash ' + 'it uses more memory. When training larger models this can trigger ' + 'alloc retries which hurts performance. If encountered, we recommend ' + 'using attn_impl: flash if your model does not use alibi or prefix_lm.')

is a user warning ie a note to developers. It's not an actual issue.

Is there a detrimental issue?

vchiley avatar May 22 '23 15:05 vchiley

it doesn't output any text just "^C"

image

OmarMohammed88 avatar May 22 '23 15:05 OmarMohammed88

Hi @OmarMohammed88 , following up here -- are you able to successfully run that script with a smaller model, say gpt2?

python /content/llm-foundry/scripts/inference/hf_generate.py --name_or_path 'gpt2' ...

I have a feeling that on a T4, trying to run the MPT model will default to bfloat16, which is probably only supported via a very slow path on T4 (which has no tensor cores for BF16). And so there may be a chance that the script is just taking a long time to run.

Could you try running mosaicml/mpt-7b with --attn_impl torch and --model_dtype [fp32 or fp16]

abhi-mosaic avatar May 31 '23 01:05 abhi-mosaic

Closing this issue for now. For a reference implementation of using model.generate, you can check out our hf_generate.py script here which should work on CPU, GPU, multi-GPU, etc. and has minimal imports.

https://github.com/mosaicml/llm-foundry/blob/main/scripts/inference/hf_generate.py

abhi-mosaic avatar Jun 13 '23 16:06 abhi-mosaic