1096125073 comments

Results 8 comments of

1096125073

output from memory_efficient_attention not exactly the same with pytorch equivalent implementation

I’m experiencing the same problems

add support for telechat2

Hi, Telechat2 is a language model independently developed by China Telecom AI Company. In order to facilitate users to use the superior awq algorithm, this PR was raised.

add support for telechat2

https://huggingface.co/Tele-AI/TeleChat2-7B-32K

The engine generated by each build has different results for the same input.

Is there any way to ensure that the engine generated by build is identical?This is important for engineering deployment.

batch inference is different with single

i have disable custom_all_reduce when build engine

batch inference is different with single

> Hi @1096125073 , since different batch sizes may lead to different kernels. So, the results can be different. This is a known issue. Thank you for your answer! I'm...

batch inference is different with single

> @1096125073 Yes, I get your point: repeat the same input prompt 4 times, and make it a batch, but the outputs are different from batch size 1. Unfortunately, it's...

batch inference is different with single

> @1096125073 Do you use multiple GPUs? If you use multi-GPU, you can use NCCL_ALGO=Tree to ensure stable reduce order. NCCL usually select Ring algo, which has unstable reduce order,...