Yu Zhang
Yu Zhang
Hi, I find a big results difference when using tl.store (under bfloat16). ```py # -*- coding: utf-8 -*- import torch import triton import triton.language as tl @triton.jit def attention_nostore_fwd_kernel( q,...
Hi, thanks for your nice paper and code! I have noticed that the standard split for RCV1 train/test in the original paper is 23,149/781,265. But from the data downloaded from...
Hi all, Thank you for developing this great project. Currently, the implementation naively iterates through all batches until the specified number have been consumed, which can be extremely slow for...
This PR aims to implement a resumable `BufferShuffledExamplesIterable`. Instead of saving the entire buffer content, which is very memory-intensive, the newly implemented `BufferShuffledExamplesIterable` saves only the minimal state necessary for...
Hello, thank you for this great work. https://github.com/linkedin/Liger-Kernel/blob/acd82728207ebafad28d448640502c108901a967/src/liger_kernel/ops/fused_linear_cross_entropy.py#L69 https://github.com/linkedin/Liger-Kernel/blob/acd82728207ebafad28d448640502c108901a967/src/liger_kernel/ops/fused_linear_cross_entropy.py#L91-L96 I'm wondering if there are any reasons for upcasting/downcasting the logits dtype outside the kernel? If I understand correctly, we already...