Yu Zhang issues

Results 5 issues of


                                            Yu Zhang

Big results difference when using `tl.store`

Hi, I find a big results difference when using tl.store (under bfloat16). ```py # -*- coding: utf-8 -*- import torch import triton import triton.language as tl @triton.jit def attention_nostore_fwd_kernel( q,...

About train/test split of RCV1

Hi, thanks for your nice paper and code! I have noticed that the standard split for RCV1 train/test in the original paper is 23,149/781,265. But from the data downloaded from...

Improve `skip_first_batches` method to efficiently support `IterableDataset` and `StatefulDataloader`

Hi all, Thank you for developing this great project. Currently, the implementation naively iterates through all batches until the specified number have been consumed, which can be extremely slow for...

Make `BufferShuffledExamplesIterable` resumable

This PR aims to implement a resumable `BufferShuffledExamplesIterable`. Instead of saving the entire buffer content, which is very memory-intensive, the newly implemented `BufferShuffledExamplesIterable` saves only the minimal state necessary for...

Reasons for upcasting the logits dtype outside the kernel

Hello, thank you for this great work. https://github.com/linkedin/Liger-Kernel/blob/acd82728207ebafad28d448640502c108901a967/src/liger_kernel/ops/fused_linear_cross_entropy.py#L69 https://github.com/linkedin/Liger-Kernel/blob/acd82728207ebafad28d448640502c108901a967/src/liger_kernel/ops/fused_linear_cross_entropy.py#L91-L96 I'm wondering if there are any reasons for upcasting/downcasting the logits dtype outside the kernel? If I understand correctly, we already...