gpt-fast Optimize Int8 Woq for CPU

Optimize Int8 Woq for CPU

Open yanbing-j opened this issue 1 year ago • 2 comments

This PR is to optimize Int8 Woq both in gpt-fast and mixtral-moe.

At the current stage, we use torch.ops.aten._weight_int8pack_mm as an workaround. And this workaround will be removed when https://github.com/pytorch/pytorch/pull/120985 is merged in PyTorch stable release. Meanwhile, update int8 weight dimension according to torch.ops.aten._weight_int8pack_mm in https://github.com/pytorch/pytorch/pull/118056 and add CPU profiling.

Apr 23 '24 06:04 yanbing-j

@HDCharles could you please take a look? Thanks!

Apr 23 '24 06:04 yanbing-j

Hi @yanboliang , could you please take a look? Thanks!

May 07 '24 05:05 yanbing-j

gpt-fast gpt-fast copied to clipboard

Optimize Int8 Woq for CPU

gpt-fast
gpt-fast copied to clipboard