Wizyoung
Wizyoung
This issue is not fixed yet.
Just add a cuda stream for device separation may fix this. ```python stream = torch.cuda.Stream() # you can place it into __init__ with torch.cuda.stream(stream): output = self.model.generate(xxx) stream.synchronize() ``` This...
I'm wondering why this pr is not merged?
I found the numerical precision cannot be guarranted in this pr under fp32
> If the dataset is on local disk, you should specify the path to `local`. Otherwise `remote` is fine. Last time I used version 0.8.1 with my dataset stored on...
@snarayan21 Okay, I will try to reproduce this issue with the latest version next week and then provide more details in a new issue report.
Actually, if we align the CHUNK_SIZE of the Torch-compiled FLCE with the strategy used in Liger's FLCE, the compiled version is only slightly faster than the Liger version, but it...
By setting CHUNK_SIZE and run benchmark script:  
@Chillee By referencing https://github.com/linkedin/Liger-Kernel/blob/main/src/liger_kernel/ops/fused_linear_cross_entropy.py#L23. I mean, chaning chunk size in torch compiled FLCE. Your default chunk size is 1024, and I change to 256. Then I have:  By only...
I have done some quick tests with different B, T, D and V to mimic my training conditions(llama3 and gemma2) in my env, my conclusion is that torch compiled flce...