llm-foundry
llm-foundry copied to clipboard
FasterTransformer support for storywriter
🚀 Feature Request
The script located under scripts/inference for converting HF checkpoint to FT format doesn't work for MPT-7B-Storywriter because it has clip_qkv = 6 unlike other MPT-7B models with clip_qkv = null. Loading the model without turning on clip_qkv led to gibberish generations.
There are two reasons why I am looking for FasterTransformer support:
- storywriter is used for generating long texts. This takes longer and requires more efficient inference to shorten the latency.
- Also for generating long texts, more GPU memory is needed. I need a memory efficient way which also support model / tensor parallelism for serving the storywriter model.
- FasterTransformer seems to be the closest option.
However, please let me know if there is an alternative. I am also curious to learn how MosaicML served storywriter with a 64k context window.
As you pointed out, StoryWriter has qkv_clip and currently doesn't work with FT. We have two options: 1) Adding clipping support in FT 2) Creating a fine-tuned version of StoryWriter that doesn't have qkv_clipping.
currently both 1) and 2) don't exist. We have been wanting to do 1) but haven't gotten to it yet.
I am able to get reasonable result from FasterTransformer + MPT-7B-Storywriter with 2 changes to FasterTransformer:
- After src/fastertransformer/kernels/unfused_attention_kernels.cu Line 1287, add
val = max(-6.f, min(val, 6.f));. This is for the clamp / clip operation on qkv. - Replace src/fastertransformer/kernels/gen_relative_pos_bias.cu Line 220 with
alibi_slopes[h] = static_cast<T>(powf(powf(0.5f, powf(0.5f, log2f(num_heads_pow2) - 3.f)), (h + 1)*2));. This reflects the"alibi_bias_max": 16,entry in the model config. (Other MPT-7B models use 8 instead of 16).
I am running into this same issue. The provided script to convert storywriter HF to FT leads to this error:
RuntimeError: clip_qkv is enabled for this MPT model. This may not work as expected in FT. Use --force to force a conversion.
I am running into this same issue. The provided script to convert storywriter HF to FT leads to this error:
RuntimeError: clip_qkv is enabled for this MPT model. This may not work as expected in FT. Use --force to force a conversion.
You can just set --force to export the model and used a modified FasterTransformer (by following my above comment).
I am able to get reasonable result from FasterTransformer + MPT-7B-Storywriter with 2 changes to FasterTransformer
Thanks for documenting this. Not sure if FasterTransformer would accept a PR for MPT support given that you need to modify the cuda code. Would be good to see support for it though, given that the Storywriter model is an important use-case.
Closing as I believe this has been fully answered. Please open a new issue if you still have questions.