llm-foundry
llm-foundry copied to clipboard
FasterTransformer
Hi, I saw in mpt model card that the models could run with FasterTransformer I didn't find any details about that anywhere can you guys share the conversion scripts or help there?
Thanks
MPT is a GPT style network You'd want to create a conversion script, similar to this one, for converting the MPT HF model into the FT format. When we write it, it'll probably land in the llm-foundry/scripts/misc/ folder (or be directly contributed to FT).
Thanks @vchiley AFAIK it is using AliBi and some other things that aren't native in GPT FT version maybe Bloom is more similar? and from playing with mpt around in HF version and try base as GPT / Bloom (by renaming and loading state dicts) I get nonsense so I wonder if there is some other impl. that prevents it straight forward such as other ordering in QKV layers or etc.
wdyt ?
Thanks !!
The *.c_*.*
naming makes me think they use 1x1 conv layers instead of linear layers (functionally the same thing, for some reason early transformer implementations use to do this; e.g. here).
A 1x1 conv and a linear layers are functionally the same thing, but the weight tensors are transposes of one another. Try transposing the MPT weights before loading them into the FT conversion script 🤷♂️
Any updates on this? As all MPT model cards describe the built-in optimization for FlashAttention and FasterTransformer, I am curious why the FasterTransformer part hasn't been tested before release? Or was it and you just didn't come to write a conversion script?
Also, what about the Transformer Engine of H100's? How easy/difficult is it to make the model work with that? (FP8)
Transformer Engine
We've played around with TE and H100 FP8 support It works and we'll include everything when we have more seat time with H100s so we can test everything more thoroughly.
@vchiley Thank you so much! 🚀😍
I also need the FT conversion script; it would be super helpful for me 🥰
@vchiley You are also talking about uzing TransformerEngine in inference right?
This would be extremely helpful for me too!
@xgal, @SinanAkkoyun, @meitalbensinai, @therealadityashankar We will soon add conversion and run scripts
I'd greatly appreciate that so so much!
@xgal , @SinanAkkoyun , @meitalbensinai , @therealadityashankar : Conversion script was added in #169 . Please give it a try and let us know if you run into any issues.
@dskhudia this is nice, I would love a script to run this with tokenizers, or a minimal example, so I can check this out
@therealadityashankar : That's coming up soon once I clean up some code and verify the results.
@xgal , @SinanAkkoyun , @meitalbensinai , @therealadityashankar : The script to run the converted model has also been added. Let us know if you face any issues.
Hi! What TPS is expected for the FT implementation?
@SinanAkkoyun It depends on batching vs no batching, input/output lengths etc. but I see the following numbers on A100-40G for the 7B model:
@dskhudia Thank you very much!!! <3
I converted the mpt7b model into fastertransformer format and served it as a http service using triton inference server. however it failed to generate. e.g. with input "the model does not work " the output is "the model does not work ���������������������" so it is pattern that always output the prompts plus trailing � Anyone knowing why? Thanks. @dskhudia
@nik-mosaic had to do the setup right to make it generate. @nik-mosaic please take a look.
@wuflyh what does your config.pbtxt look like? It should be modeled after the GPT config.pbtxt because the MPT is essentially a modified GPT model.
name: "mpt7b" backend: "fastertransformer" max_batch_size: 1024
model_transaction_policy { decoupled: False }
input [ { name: "input_ids" data_type: TYPE_UINT32 dims: [ -1 ] }, { name: "start_id" data_type: TYPE_UINT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "end_id" data_type: TYPE_UINT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "input_lengths" data_type: TYPE_UINT32 dims: [ 1 ] reshape: { shape: [ ] } }, { name: "request_output_len" data_type: TYPE_UINT32 dims: [ -1 ] }, { name: "runtime_top_k" data_type: TYPE_UINT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "runtime_top_p" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "beam_search_diversity_rate" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "temperature" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "len_penalty" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "repetition_penalty" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "random_seed" data_type: TYPE_UINT64 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "is_return_log_probs" data_type: TYPE_BOOL dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "beam_width" data_type: TYPE_UINT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "bad_words_list" data_type: TYPE_INT32 dims: [ 2, -1 ] optional: true }, { name: "stop_words_list" data_type: TYPE_INT32 dims: [ 2, -1 ] optional: true }, { name: "prompt_learning_task_name_ids" data_type: TYPE_UINT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "top_p_decay" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "top_p_min" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "top_p_reset_ids" data_type: TYPE_UINT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true } ] output [ { name: "output_ids" data_type: TYPE_UINT32 dims: [ -1, -1 ] }, { name: "sequence_length" data_type: TYPE_UINT32 dims: [ -1 ] }, { name: "cum_log_probs" data_type: TYPE_FP32 dims: [ -1 ] }, { name: "output_log_probs" data_type: TYPE_FP32 dims: [ -1, -1 ] } ] instance_group [ { count: 1 kind: KIND_CPU } ] parameters { key: "tensor_para_size" value: { string_value: "2" } } parameters { key: "pipeline_para_size" value: { string_value: "1" } } parameters { key: "data_type" value: { string_value: "fp16" } } parameters { key: "model_type" value: { string_value: "GPT" } } parameters { key: "model_checkpoint_path" value: { string_value: "/models/mpt7b/1/2-gpu/" } } parameters { key: "enable_custom_all_reduce" value: { string_value: "0" } }
Can you also paste your error trace? Does the server fail to start, or does it load but throw an error when you try to run inference?
@nik-mosaic yes I am following the GPT config.pbtxt and above is my file. I generated the fastertransformer format for 2 gpus so set the "tensor_para_size" to 2
the triton inference server launches normally. there is no error in the log when inferencing.
Your configs all look reasonable and it is normally a good sign if the server launches and there are no errors during inference.
Do you also get this result with MPT-7B generated for 1 GPU and tensor_para_size = 1? I would be very interested if only the 2 GPU is not working for you. We have only tested 1 GPU since MPT-7B fits on a single GPU for us, but we want to fully support multi-gpu inference.
the issue remains the same on 1-gpu (which I generated using "convert_hf_mpt_to_ft.py -o mpt7b_1gpu -i mosaicml/mpt-7b-instruct -i_g 1"
@nik-mosaic what version of triton-ft-backend do you use to serve the mpt7b model? Thanks.
@wuflyh Triton Inference Server: 23.04, and the main branch of the FasterTransformer backend repository.
Thanks @nik-mosaic that might be the reason. I was using 22.12. I will install 23.04 to give it a try.