David MacLeod comments

Results 26 comments of


                                            David MacLeod

questuion : how to inference Int8 models (GPT) supported through ZeroQuant technology ?

> Can you share a code snippet you used for loading GPT? Also, currently, DS-inference uses fp16 special CUDA kernels for inference which is not the case for int8. int8...

ZeroQuant quantization kernels and LKD

@yaozhewei any news on this?

ZeroQuant quantization kernels and LKD

Thanks @yaozhewei! Do you know whether there is a rough timeline for this? e.g. 1 month, 6 months, 1 year? It would be very useful to know as we'd like...

Raise exception when falling back to pinned memory

Is there any developments here? If I was to contribute this change would it be considered? Would an environment variable or a CLI arg be more appropriate here for disabling...

hi ，when the ZeroQuant inference will be released?

Any updates on this? Thanks.

Hi @blefaudeux, I will share some timings soon but it initially looks promising, primarily because `torch.jit` appears to be able to fuse `apply_rotary_pos_emb` into a single kernel for the non-autograd...

Rotary Embeddings + Triton

Thanks for the replies, that makes things a lot clearer! @ptillet why is that in the softmax tutorial the BLOCK_SIZE is set to be the next power of 2 greater...

Rotary Embeddings + Triton

Thanks @ptillet! I was also wondering if currently Triton has the potential to slice (or otherwise chunk) tensors after they have been loaded into SRAM? Rotary embeds include an op...

Rotary Embeddings + Triton

Thanks @ptillet, just getting round to looking at this again. In the example above what happens if the `x_slice_1` is not divisible by the block size and we end up...