mlc-llm icon indicating copy to clipboard operation
mlc-llm copied to clipboard

Enable weight compression in GPU

Open jinhongyii opened this issue 1 year ago • 0 comments

This PR enables weight compression in GPU. Previously the weight compression is run in CPU because the uncompressed weight is too large to fit in GPU, and running on CPU is pretty slow in fp16 case. Now we switch to GPU. The technique we use to fit the uncompressed weight into GPU memory is lazy loading. We load the weight right before the first use, and instantly free it after the last use.

cc: @tqchen @MasterJH5574

jinhongyii avatar May 03 '23 01:05 jinhongyii