Horace He comments

Results 242 comments of


                                            Horace He

batching/dynamic batching

It is not so difficult to modify it to support batch usecases, but supporting dynamic batching is quite a bit more work. If you really want continuous batching I would...

Does `gpt-fast` work on V100 GPUs?

We haven't tested on V100s so I'm not sure. I thought it worked but haven't checked.

Does `gpt-fast` work on V100 GPUs?

I actually tried it just now. The issue is that V100 has poor bfloat16 support. If you just change all the bfloat16 instances to float16 it should work.

Bandwidth achieved for INT8 is much smaller than FP16

Yeah, generally speaking, when you go from int8 to int4, the "theoretical" speedup should be 2x. But in practice, due to Amdahl's law type reasons, you end up getting bottlenecked...

Try Tensor Parallel on a server equipped with two V100 linked by NVLINK, but got a performance degradation

@duanzhaol I don't think you're using compilation are you?

Try Tensor Parallel on a server equipped with two V100 linked by NVLINK, but got a performance degradation

Compilation will significantly reduce the tensor-parallel latency. In general, gpt-fast will not be particularly fast without using compilation :P

Try Tensor Parallel on a server equipped with two V100 linked by NVLINK, but got a performance degradation

@duanzhaol Out of curiosity, what level of overhead is acceptable?

Added memory budget to partitioner

@shunting314 This is how the perf dashboard looks like with `memory_budget = 0.5`. https://hud.pytorch.org/benchmark/compilers?startTime=Wed%2C%2029%20May%202024%2020%3A26%3A17%20GMT&stopTime=Wed%2C%2005%20Jun%202024%2020%3A26%3A17%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=gh/chillee/303/head&lCommit=d9c93cab3fb20e5648b35e24a34a05cedc199479&rBranch=main&rCommit=6d21685b45336b793b1172a7ce76b0bf3876eebf Not sure why it doesn't increase HF memory that much - perhaps because it's a single...

Added memory budget to partitioner

@pytorchbot merge -i "flaky test"

Added memory budget to partitioner

@pytorchbot merge -i