llm-foundry icon indicating copy to clipboard operation
llm-foundry copied to clipboard

FasterTransformer support for storywriter

Open lorabit110 opened this issue 2 years ago • 2 comments

🚀 Feature Request

The script located under scripts/inference for converting HF checkpoint to FT format doesn't work for MPT-7B-Storywriter because it has clip_qkv = 6 unlike other MPT-7B models with clip_qkv = null. Loading the model without turning on clip_qkv led to gibberish generations.

lorabit110 avatar Jun 22 '23 02:06 lorabit110

There are two reasons why I am looking for FasterTransformer support:

  1. storywriter is used for generating long texts. This takes longer and requires more efficient inference to shorten the latency.
  2. Also for generating long texts, more GPU memory is needed. I need a memory efficient way which also support model / tensor parallelism for serving the storywriter model.
  3. FasterTransformer seems to be the closest option.

However, please let me know if there is an alternative. I am also curious to learn how MosaicML served storywriter with a 64k context window.

lorabit110 avatar Jun 27 '23 07:06 lorabit110

As you pointed out, StoryWriter has qkv_clip and currently doesn't work with FT. We have two options: 1) Adding clipping support in FT 2) Creating a fine-tuned version of StoryWriter that doesn't have qkv_clipping.

currently both 1) and 2) don't exist. We have been wanting to do 1) but haven't gotten to it yet.

dskhudia avatar Jun 27 '23 07:06 dskhudia

I am able to get reasonable result from FasterTransformer + MPT-7B-Storywriter with 2 changes to FasterTransformer:

  1. After src/fastertransformer/kernels/unfused_attention_kernels.cu Line 1287, add val = max(-6.f, min(val, 6.f));. This is for the clamp / clip operation on qkv.
  2. Replace src/fastertransformer/kernels/gen_relative_pos_bias.cu Line 220 with alibi_slopes[h] = static_cast<T>(powf(powf(0.5f, powf(0.5f, log2f(num_heads_pow2) - 3.f)), (h + 1)*2)); . This reflects the "alibi_bias_max": 16, entry in the model config. (Other MPT-7B models use 8 instead of 16).

lorabit110 avatar Jun 28 '23 05:06 lorabit110

I am running into this same issue. The provided script to convert storywriter HF to FT leads to this error:

RuntimeError: clip_qkv is enabled for this MPT model. This may not work as expected in FT. Use --force to force a conversion.

zacharyblank avatar Jun 28 '23 22:06 zacharyblank

I am running into this same issue. The provided script to convert storywriter HF to FT leads to this error:

RuntimeError: clip_qkv is enabled for this MPT model. This may not work as expected in FT. Use --force to force a conversion.

You can just set --force to export the model and used a modified FasterTransformer (by following my above comment).

lorabit110 avatar Jun 29 '23 05:06 lorabit110

I am able to get reasonable result from FasterTransformer + MPT-7B-Storywriter with 2 changes to FasterTransformer

Thanks for documenting this. Not sure if FasterTransformer would accept a PR for MPT support given that you need to modify the cuda code. Would be good to see support for it though, given that the Storywriter model is an important use-case.

casper-hansen avatar Jun 29 '23 22:06 casper-hansen

Closing as I believe this has been fully answered. Please open a new issue if you still have questions.

dakinggg avatar Feb 03 '24 20:02 dakinggg