llm-foundry FasterTransformer

Hi, I saw in mpt model card that the models could run with FasterTransformer I didn't find any details about that anywhere can you guys share the conversion scripts or help there?

Thanks

May 08 '23 12:05 xgal

MPT is a GPT style network You'd want to create a conversion script, similar to this one, for converting the MPT HF model into the FT format. When we write it, it'll probably land in the llm-foundry/scripts/misc/ folder (or be directly contributed to FT).

May 08 '23 16:05 vchiley

Thanks @vchiley AFAIK it is using AliBi and some other things that aren't native in GPT FT version maybe Bloom is more similar? and from playing with mpt around in HF version and try base as GPT / Bloom (by renaming and loading state dicts) I get nonsense so I wonder if there is some other impl. that prevents it straight forward such as other ordering in QKV layers or etc.

wdyt ?

Thanks !!

May 08 '23 17:05 xgal

The *.c_*.* naming makes me think they use 1x1 conv layers instead of linear layers (functionally the same thing, for some reason early transformer implementations use to do this; e.g. here).

A 1x1 conv and a linear layers are functionally the same thing, but the weight tensors are transposes of one another. Try transposing the MPT weights before loading them into the FT conversion script 🤷‍♂️

May 08 '23 17:05 vchiley

Any updates on this? As all MPT model cards describe the built-in optimization for FlashAttention and FasterTransformer, I am curious why the FasterTransformer part hasn't been tested before release? Or was it and you just didn't come to write a conversion script?

Also, what about the Transformer Engine of H100's? How easy/difficult is it to make the model work with that? (FP8)

May 09 '23 15:05 SinanAkkoyun

Transformer Engine

We've played around with TE and H100 FP8 support It works and we'll include everything when we have more seat time with H100s so we can test everything more thoroughly.

May 09 '23 16:05 vchiley

@vchiley Thank you so much! 🚀😍

May 09 '23 16:05 SinanAkkoyun

I also need the FT conversion script; it would be super helpful for me 🥰

May 10 '23 06:05 meitalbensinai

@vchiley You are also talking about uzing TransformerEngine in inference right?

May 14 '23 16:05 SinanAkkoyun

This would be extremely helpful for me too!

May 15 '23 16:05 therealadityashankar

@xgal, @SinanAkkoyun, @meitalbensinai, @therealadityashankar We will soon add conversion and run scripts

May 15 '23 20:05 dskhudia

I'd greatly appreciate that so so much!

May 15 '23 21:05 therealadityashankar

@xgal , @SinanAkkoyun , @meitalbensinai , @therealadityashankar : Conversion script was added in #169 . Please give it a try and let us know if you run into any issues.

May 24 '23 20:05 dskhudia

@dskhudia this is nice, I would love a script to run this with tokenizers, or a minimal example, so I can check this out

May 25 '23 08:05 therealadityashankar

@therealadityashankar : That's coming up soon once I clean up some code and verify the results.

May 25 '23 16:05 dskhudia

@xgal , @SinanAkkoyun , @meitalbensinai , @therealadityashankar : The script to run the converted model has also been added. Let us know if you face any issues.

May 31 '23 17:05 dskhudia

Hi! What TPS is expected for the FT implementation?

May 31 '23 19:05 SinanAkkoyun

@SinanAkkoyun It depends on batching vs no batching, input/output lengths etc. but I see the following numbers on A100-40G for the 7B model: Screenshot 2023-05-30 at 3 55 09 PM

May 31 '23 19:05 dskhudia

@dskhudia Thank you very much!!! <3

May 31 '23 19:05 SinanAkkoyun

I converted the mpt7b model into fastertransformer format and served it as a http service using triton inference server. however it failed to generate. e.g. with input "the model does not work " the output is "the model does not work ��" so it is pattern that always output the prompts plus trailing � Anyone knowing why? Thanks. @dskhudia

Jun 20 '23 01:06 wuflyh

@nik-mosaic had to do the setup right to make it generate. @nik-mosaic please take a look.

Jun 20 '23 14:06 dskhudia

@wuflyh what does your config.pbtxt look like? It should be modeled after the GPT config.pbtxt because the MPT is essentially a modified GPT model.

Jun 20 '23 17:06 nik-mosaic

name: "mpt7b" backend: "fastertransformer" max_batch_size: 1024

model_transaction_policy { decoupled: False }

input [ { name: "input_ids" data_type: TYPE_UINT32 dims: [ -1 ] }, { name: "start_id" data_type: TYPE_UINT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "end_id" data_type: TYPE_UINT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "input_lengths" data_type: TYPE_UINT32 dims: [ 1 ] reshape: { shape: [ ] } }, { name: "request_output_len" data_type: TYPE_UINT32 dims: [ -1 ] }, { name: "runtime_top_k" data_type: TYPE_UINT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "runtime_top_p" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "beam_search_diversity_rate" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "temperature" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "len_penalty" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "repetition_penalty" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "random_seed" data_type: TYPE_UINT64 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "is_return_log_probs" data_type: TYPE_BOOL dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "beam_width" data_type: TYPE_UINT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "bad_words_list" data_type: TYPE_INT32 dims: [ 2, -1 ] optional: true }, { name: "stop_words_list" data_type: TYPE_INT32 dims: [ 2, -1 ] optional: true }, { name: "prompt_learning_task_name_ids" data_type: TYPE_UINT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "top_p_decay" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "top_p_min" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "top_p_reset_ids" data_type: TYPE_UINT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true } ] output [ { name: "output_ids" data_type: TYPE_UINT32 dims: [ -1, -1 ] }, { name: "sequence_length" data_type: TYPE_UINT32 dims: [ -1 ] }, { name: "cum_log_probs" data_type: TYPE_FP32 dims: [ -1 ] }, { name: "output_log_probs" data_type: TYPE_FP32 dims: [ -1, -1 ] } ] instance_group [ { count: 1 kind: KIND_CPU } ] parameters { key: "tensor_para_size" value: { string_value: "2" } } parameters { key: "pipeline_para_size" value: { string_value: "1" } } parameters { key: "data_type" value: { string_value: "fp16" } } parameters { key: "model_type" value: { string_value: "GPT" } } parameters { key: "model_checkpoint_path" value: { string_value: "/models/mpt7b/1/2-gpu/" } } parameters { key: "enable_custom_all_reduce" value: { string_value: "0" } }

Jun 20 '23 18:06 wuflyh

Can you also paste your error trace? Does the server fail to start, or does it load but throw an error when you try to run inference?

Jun 20 '23 18:06 nik-mosaic

@nik-mosaic yes I am following the GPT config.pbtxt and above is my file. I generated the fastertransformer format for 2 gpus so set the "tensor_para_size" to 2

Jun 20 '23 18:06 wuflyh

the triton inference server launches normally. there is no error in the log when inferencing.

Jun 20 '23 18:06 wuflyh

Your configs all look reasonable and it is normally a good sign if the server launches and there are no errors during inference.

Do you also get this result with MPT-7B generated for 1 GPU and tensor_para_size = 1? I would be very interested if only the 2 GPU is not working for you. We have only tested 1 GPU since MPT-7B fits on a single GPU for us, but we want to fully support multi-gpu inference.

Jun 20 '23 18:06 nik-mosaic

the issue remains the same on 1-gpu (which I generated using "convert_hf_mpt_to_ft.py -o mpt7b_1gpu -i mosaicml/mpt-7b-instruct -i_g 1"

Jun 20 '23 18:06 wuflyh

@nik-mosaic what version of triton-ft-backend do you use to serve the mpt7b model? Thanks.

Jun 20 '23 21:06 wuflyh

@wuflyh Triton Inference Server: 23.04, and the main branch of the FasterTransformer backend repository.

Jun 21 '23 22:06 nik-mosaic

Thanks @nik-mosaic that might be the reason. I was using 22.12. I will install 23.04 to give it a try.

Jun 21 '23 22:06 wuflyh

llm-foundry llm-foundry copied to clipboard

FasterTransformer

llm-foundry
llm-foundry copied to clipboard