gpt-fast
gpt-fast copied to clipboard
Optimize Int8 Woq for CPU
This PR is to optimize Int8 Woq both in gpt-fast and mixtral-moe.
At the current stage, we use torch.ops.aten._weight_int8pack_mm as an workaround. And this workaround will be removed when https://github.com/pytorch/pytorch/pull/120985 is merged in PyTorch stable release. Meanwhile, update int8 weight dimension according to torch.ops.aten._weight_int8pack_mm in https://github.com/pytorch/pytorch/pull/118056 and add CPU profiling.
@HDCharles could you please take a look? Thanks!
Hi @yanboliang , could you please take a look? Thanks!