Daya Khudia comments

Results 38 comments of


                                            Daya Khudia

Tensor Parallel MLP with torch2.0

> Is self-attention parallelizable with some code modification? It is but with code modifications. See https://pytorch.org/docs/stable/_modules/torch/distributed/tensor/parallel/multihead_attention_tp.html#TensorParallelMultiheadAttention

Tensor Parallel MLP with torch2.0

@dakinggg yes.

Add `device_map` support for `hf_generate.py` and `hf_chat.py`

How is it parallelizing the model?

FasterTransformer support for storywriter

As you pointed out, StoryWriter has qkv_clip and currently doesn't work with FT. We have two options: 1) Adding clipping support in FT 2) Creating a fine-tuned version of StoryWriter...

Random inference results after conversion to FasterTransformers

top_k is 1 and that does greedy search. May be use top_k = 30 and play around with temperature.

Random inference results after conversion to FasterTransformers

@savemuri : Any reason why you have use_gpt_decoder_ops set to True? We haven't tried running through this path. Also could you try with output_len=256 with the converted model.

inference - is it optimized for api usage?

@mantrakp2004 : Inference doesn't have public endpoints. The only public way to interact with these model is thorough HF interface. For example, https://huggingface.co/spaces/mosaicml/mpt-30b-chat For private production scale usage, please get...

Onnx Inference Script

@nik-mosaic Could you take a look at this please?