Nicolas Patry

Results 978 comments of Nicolas Patry

This model seems to be sharing it's gate_proj, however the modeling code doesn't reflect that: https://huggingface.co/baichuan-inc/baichuan-7B/blob/main/modeling_baichuan.py Not sure if it's intentional.

Hey @Atry thanks for the contribution. Do you mind sharing a bit more about the problem this is trying to solve ?

Hey, do you know about https://huggingface.co/docs/peft/main/en/package_reference/tuners#peft.LoraModel.merge_and_unload Basically, you could ``` model = model.merge_and_unload() model.save_pretrained("mynewmergedmodel") ``` which will "write" the peft weights directly into the model, making it a regular transformer...

Not in latency (Depends on the benchmark/hardware, but it is basically on par). PagedAttention seems to be nicer with respect to VRAM usage meaning it's better when you're low on...

> In this scenario, do you think it makes sense to shard over 2 GPUs a model that can fit in a single GPU, paying the sharding latency price chasing...

Which model is it ? The tool is trying to convert the training parameters which are not convertible. We will just need to skip it.

Indeed there's a training file here: https://github.com/huggingface/text-generation-inference/pull/485

Do you mind opening an issue directly in https://github.com/huggingface/chat-ui since this is what seems to be the issue ? We don't really know what's going on, but it seems that...