Results 33 comments of Wizyoung

This issue is not fixed yet.

Just add a cuda stream for device separation may fix this. ```python stream = torch.cuda.Stream() # you can place it into __init__ with torch.cuda.stream(stream): output = self.model.generate(xxx) stream.synchronize() ``` This...

I'm wondering why this pr is not merged?

I found the numerical precision cannot be guarranted in this pr under fp32

> If the dataset is on local disk, you should specify the path to `local`. Otherwise `remote` is fine. Last time I used version 0.8.1 with my dataset stored on...

@snarayan21 Okay, I will try to reproduce this issue with the latest version next week and then provide more details in a new issue report.

Actually, if we align the CHUNK_SIZE of the Torch-compiled FLCE with the strategy used in Liger's FLCE, the compiled version is only slightly faster than the Liger version, but it...

By setting CHUNK_SIZE and run benchmark script: ![image](https://github.com/user-attachments/assets/2de14072-7a48-4439-92d9-d460c0af2f0b) ![image](https://github.com/user-attachments/assets/8bc4d1c3-0fb6-489b-a4b6-4706f8fb2e76)

@Chillee By referencing https://github.com/linkedin/Liger-Kernel/blob/main/src/liger_kernel/ops/fused_linear_cross_entropy.py#L23. I mean, chaning chunk size in torch compiled FLCE. Your default chunk size is 1024, and I change to 256. Then I have: ![image](https://github.com/user-attachments/assets/06e43de9-bd5f-4d08-a00e-cef45fa1f80f) By only...

I have done some quick tests with different B, T, D and V to mimic my training conditions(llama3 and gemma2) in my env, my conclusion is that torch compiled flce...