llm-foundry icon indicating copy to clipboard operation
llm-foundry copied to clipboard

FasterTransformer

Open xgal opened this issue 1 year ago • 31 comments

Hi, I saw in mpt model card that the models could run with FasterTransformer I didn't find any details about that anywhere can you guys share the conversion scripts or help there?

Thanks

xgal avatar May 08 '23 12:05 xgal

MPT is a GPT style network You'd want to create a conversion script, similar to this one, for converting the MPT HF model into the FT format. When we write it, it'll probably land in the llm-foundry/scripts/misc/ folder (or be directly contributed to FT).

vchiley avatar May 08 '23 16:05 vchiley

Thanks @vchiley AFAIK it is using AliBi and some other things that aren't native in GPT FT version maybe Bloom is more similar? and from playing with mpt around in HF version and try base as GPT / Bloom (by renaming and loading state dicts) I get nonsense so I wonder if there is some other impl. that prevents it straight forward such as other ordering in QKV layers or etc.

wdyt ?

Thanks !!

xgal avatar May 08 '23 17:05 xgal

The *.c_*.* naming makes me think they use 1x1 conv layers instead of linear layers (functionally the same thing, for some reason early transformer implementations use to do this; e.g. here).

A 1x1 conv and a linear layers are functionally the same thing, but the weight tensors are transposes of one another. Try transposing the MPT weights before loading them into the FT conversion script 🤷‍♂️

vchiley avatar May 08 '23 17:05 vchiley

Any updates on this? As all MPT model cards describe the built-in optimization for FlashAttention and FasterTransformer, I am curious why the FasterTransformer part hasn't been tested before release? Or was it and you just didn't come to write a conversion script?

Also, what about the Transformer Engine of H100's? How easy/difficult is it to make the model work with that? (FP8)

SinanAkkoyun avatar May 09 '23 15:05 SinanAkkoyun

Transformer Engine

We've played around with TE and H100 FP8 support It works and we'll include everything when we have more seat time with H100s so we can test everything more thoroughly.

vchiley avatar May 09 '23 16:05 vchiley

@vchiley Thank you so much! 🚀😍

SinanAkkoyun avatar May 09 '23 16:05 SinanAkkoyun

I also need the FT conversion script; it would be super helpful for me 🥰

meitalbensinai avatar May 10 '23 06:05 meitalbensinai

@vchiley You are also talking about uzing TransformerEngine in inference right?

SinanAkkoyun avatar May 14 '23 16:05 SinanAkkoyun

This would be extremely helpful for me too!

therealadityashankar avatar May 15 '23 16:05 therealadityashankar

@xgal, @SinanAkkoyun, @meitalbensinai, @therealadityashankar We will soon add conversion and run scripts

dskhudia avatar May 15 '23 20:05 dskhudia

I'd greatly appreciate that so so much!

therealadityashankar avatar May 15 '23 21:05 therealadityashankar

@xgal , @SinanAkkoyun , @meitalbensinai , @therealadityashankar : Conversion script was added in #169 . Please give it a try and let us know if you run into any issues.

dskhudia avatar May 24 '23 20:05 dskhudia

@dskhudia this is nice, I would love a script to run this with tokenizers, or a minimal example, so I can check this out

therealadityashankar avatar May 25 '23 08:05 therealadityashankar

@therealadityashankar : That's coming up soon once I clean up some code and verify the results.

dskhudia avatar May 25 '23 16:05 dskhudia

@xgal , @SinanAkkoyun , @meitalbensinai , @therealadityashankar : The script to run the converted model has also been added. Let us know if you face any issues.

dskhudia avatar May 31 '23 17:05 dskhudia

Hi! What TPS is expected for the FT implementation?

SinanAkkoyun avatar May 31 '23 19:05 SinanAkkoyun

@SinanAkkoyun It depends on batching vs no batching, input/output lengths etc. but I see the following numbers on A100-40G for the 7B model: Screenshot 2023-05-30 at 3 55 09 PM

dskhudia avatar May 31 '23 19:05 dskhudia

@dskhudia Thank you very much!!! <3

SinanAkkoyun avatar May 31 '23 19:05 SinanAkkoyun

I converted the mpt7b model into fastertransformer format and served it as a http service using triton inference server. however it failed to generate. e.g. with input "the model does not work " the output is "the model does not work ���������������������" so it is pattern that always output the prompts plus trailing � Anyone knowing why? Thanks. @dskhudia

wuflyh avatar Jun 20 '23 01:06 wuflyh

@nik-mosaic had to do the setup right to make it generate. @nik-mosaic please take a look.

dskhudia avatar Jun 20 '23 14:06 dskhudia

@wuflyh what does your config.pbtxt look like? It should be modeled after the GPT config.pbtxt because the MPT is essentially a modified GPT model.

nik-mosaic avatar Jun 20 '23 17:06 nik-mosaic

name: "mpt7b" backend: "fastertransformer" max_batch_size: 1024

model_transaction_policy { decoupled: False }

input [ { name: "input_ids" data_type: TYPE_UINT32 dims: [ -1 ] }, { name: "start_id" data_type: TYPE_UINT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "end_id" data_type: TYPE_UINT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "input_lengths" data_type: TYPE_UINT32 dims: [ 1 ] reshape: { shape: [ ] } }, { name: "request_output_len" data_type: TYPE_UINT32 dims: [ -1 ] }, { name: "runtime_top_k" data_type: TYPE_UINT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "runtime_top_p" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "beam_search_diversity_rate" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "temperature" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "len_penalty" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "repetition_penalty" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "random_seed" data_type: TYPE_UINT64 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "is_return_log_probs" data_type: TYPE_BOOL dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "beam_width" data_type: TYPE_UINT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "bad_words_list" data_type: TYPE_INT32 dims: [ 2, -1 ] optional: true }, { name: "stop_words_list" data_type: TYPE_INT32 dims: [ 2, -1 ] optional: true }, { name: "prompt_learning_task_name_ids" data_type: TYPE_UINT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "top_p_decay" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "top_p_min" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "top_p_reset_ids" data_type: TYPE_UINT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true } ] output [ { name: "output_ids" data_type: TYPE_UINT32 dims: [ -1, -1 ] }, { name: "sequence_length" data_type: TYPE_UINT32 dims: [ -1 ] }, { name: "cum_log_probs" data_type: TYPE_FP32 dims: [ -1 ] }, { name: "output_log_probs" data_type: TYPE_FP32 dims: [ -1, -1 ] } ] instance_group [ { count: 1 kind: KIND_CPU } ] parameters { key: "tensor_para_size" value: { string_value: "2" } } parameters { key: "pipeline_para_size" value: { string_value: "1" } } parameters { key: "data_type" value: { string_value: "fp16" } } parameters { key: "model_type" value: { string_value: "GPT" } } parameters { key: "model_checkpoint_path" value: { string_value: "/models/mpt7b/1/2-gpu/" } } parameters { key: "enable_custom_all_reduce" value: { string_value: "0" } }

wuflyh avatar Jun 20 '23 18:06 wuflyh

Can you also paste your error trace? Does the server fail to start, or does it load but throw an error when you try to run inference?

nik-mosaic avatar Jun 20 '23 18:06 nik-mosaic

@nik-mosaic yes I am following the GPT config.pbtxt and above is my file. I generated the fastertransformer format for 2 gpus so set the "tensor_para_size" to 2

wuflyh avatar Jun 20 '23 18:06 wuflyh

the triton inference server launches normally. there is no error in the log when inferencing.

wuflyh avatar Jun 20 '23 18:06 wuflyh

Your configs all look reasonable and it is normally a good sign if the server launches and there are no errors during inference.

Do you also get this result with MPT-7B generated for 1 GPU and tensor_para_size = 1? I would be very interested if only the 2 GPU is not working for you. We have only tested 1 GPU since MPT-7B fits on a single GPU for us, but we want to fully support multi-gpu inference.

nik-mosaic avatar Jun 20 '23 18:06 nik-mosaic

the issue remains the same on 1-gpu (which I generated using "convert_hf_mpt_to_ft.py -o mpt7b_1gpu -i mosaicml/mpt-7b-instruct -i_g 1"

wuflyh avatar Jun 20 '23 18:06 wuflyh

@nik-mosaic what version of triton-ft-backend do you use to serve the mpt7b model? Thanks.

wuflyh avatar Jun 20 '23 21:06 wuflyh

@wuflyh Triton Inference Server: 23.04, and the main branch of the FasterTransformer backend repository.

nik-mosaic avatar Jun 21 '23 22:06 nik-mosaic

Thanks @nik-mosaic that might be the reason. I was using 22.12. I will install 23.04 to give it a try.

wuflyh avatar Jun 21 '23 22:06 wuflyh