Parth Thakkar
Parth Thakkar
@pai4451 while that'll work for the python backend, it won't work for fastertransformer backend without modifying its code significantly. The proposed approach above should work for both I think. Although...
Now that I've been working with this a little more, I see why this is an issue. Ragged batches are going to be pretty common and we need to do...
Pinging on this one. Would love to have more features related to virtual desktops - shortcuts for switching, being able to re-arrange them, moving windows from one to another etc....
Hi @xunfeng1980, the python model doesn't support logprobs yet. If you set logprobs option to null, you won't face this issue. I'll keep this open till logprobs is in.
I think the best way to approach this would be to implement different classes for different model families. I can think of 3 families: 1. CausalLMs: models that can be...
Hey @ankit-db I think the config.pbtxt file just comes with the fauxpilot repository for 1 and 2gpu variants. For other variants, I think the `./converter/trition_config_gen.py` script should be invoked. I...
Hey @ankit-db I was able to generate the config.pbtxt by doing the following: 1. Modify the triton_config_gen.py file on line 59, change `params['name'] = model_name` to `params['name'] = "codegen-350M-multi"` (or...
Hey, yeah I was planning to use this for benchmarking 4bit performance of codegen models. Most of the prompts I have are over 1500 tokens or more, and these overflow...
Thanks! I just created a PR here to allow pretokenized inputs: https://github.com/ravenscroftj/ggml/pull/2 It seems to work fine for me.
Thanks! I have performed a preliminary evaluation of the 6B-4bit model on Python. I ran the model on ~2000 code completion scenarios in Python (I have a custom dataset) and...