Daya Khudia
Daya Khudia
> Is self-attention parallelizable with some code modification? It is but with code modifications. See https://pytorch.org/docs/stable/_modules/torch/distributed/tensor/parallel/multihead_attention_tp.html#TensorParallelMultiheadAttention
@dakinggg yes.
How is it parallelizing the model?
As you pointed out, StoryWriter has qkv_clip and currently doesn't work with FT. We have two options: 1) Adding clipping support in FT 2) Creating a fine-tuned version of StoryWriter...
top_k is 1 and that does greedy search. May be use top_k = 30 and play around with temperature.
@savemuri : Any reason why you have use_gpt_decoder_ops set to True? We haven't tried running through this path. Also could you try with output_len=256 with the converted model.
@mantrakp2004 : Inference doesn't have public endpoints. The only public way to interact with these model is thorough HF interface. For example, https://huggingface.co/spaces/mosaicml/mpt-30b-chat For private production scale usage, please get...
@nik-mosaic Could you take a look at this please?