Horace He

Results 242 comments of Horace He

It is not so difficult to modify it to support batch usecases, but supporting dynamic batching is quite a bit more work. If you really want continuous batching I would...

We haven't tested on V100s so I'm not sure. I thought it worked but haven't checked.

I actually tried it just now. The issue is that V100 has poor bfloat16 support. If you just change all the bfloat16 instances to float16 it should work.

Yeah, generally speaking, when you go from int8 to int4, the "theoretical" speedup should be 2x. But in practice, due to Amdahl's law type reasons, you end up getting bottlenecked...

Compilation will significantly reduce the tensor-parallel latency. In general, gpt-fast will not be particularly fast without using compilation :P

@shunting314 This is how the perf dashboard looks like with `memory_budget = 0.5`. https://hud.pytorch.org/benchmark/compilers?startTime=Wed%2C%2029%20May%202024%2020%3A26%3A17%20GMT&stopTime=Wed%2C%2005%20Jun%202024%2020%3A26%3A17%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=gh/chillee/303/head&lCommit=d9c93cab3fb20e5648b35e24a34a05cedc199479&rBranch=main&rCommit=6d21685b45336b793b1172a7ce76b0bf3876eebf Not sure why it doesn't increase HF memory that much - perhaps because it's a single...

@pytorchbot merge -i "flaky test"