David MacLeod
David MacLeod
Agree that this would be very useful
> Can you share a code snippet you used for loading GPT? Also, currently, DS-inference uses fp16 special CUDA kernels for inference which is not the case for int8. int8...
@yaozhewei any news on this?
Thanks @yaozhewei! Do you know whether there is a rough timeline for this? e.g. 1 month, 6 months, 1 year? It would be very useful to know as we'd like...
Is there any developments here? If I was to contribute this change would it be considered? Would an environment variable or a CLI arg be more appropriate here for disabling...
Any updates on this? Thanks.
Hi @blefaudeux, I will share some timings soon but it initially looks promising, primarily because `torch.jit` appears to be able to fuse `apply_rotary_pos_emb` into a single kernel for the non-autograd...
Thanks for the replies, that makes things a lot clearer! @ptillet why is that in the softmax tutorial the BLOCK_SIZE is set to be the next power of 2 greater...
Thanks @ptillet! I was also wondering if currently Triton has the potential to slice (or otherwise chunk) tensors after they have been loaded into SRAM? Rotary embeds include an op...
Thanks @ptillet, just getting round to looking at this again. In the example above what happens if the `x_slice_1` is not divisible by the block size and we end up...