Wizyoung comments

Results 33 comments of


                                            Wizyoung

[BUG] deepspeed overlap_comm data race

This issue is not fixed yet.

报错： RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.

Just add a cuda stream for device separation may fix this. ```python stream = torch.cuda.Stream() # you can place it into __init__ with torch.cuda.stream(stream): output = self.model.generate(xxx) stream.synchronize() ``` This...

Softcap support in Flce[WIP]

I'm wondering why this pr is not merged?

Softcap support in Flce[WIP]

I found the numerical precision cannot be guarranted in this pr under fp32

Cannot Load MDS Dataset

> If the dataset is on local disk, you should specify the path to `local`. Otherwise `remote` is fine. Last time I used version 0.8.1 with my dataset stored on...

Cannot Load MDS Dataset

@snarayan21 Okay, I will try to reproduce this issue with the latest version next week and then provide more details in a new issue report.

Torch compiled FLCE is 2x faster than the current FLCE

Actually, if we align the CHUNK_SIZE of the Torch-compiled FLCE with the strategy used in Liger's FLCE, the compiled version is only slightly faster than the Liger version, but it...

Torch compiled FLCE is 2x faster than the current FLCE

By setting CHUNK_SIZE and run benchmark script: ![image](https://github.com/user-attachments/assets/2de14072-7a48-4439-92d9-d460c0af2f0b) ![image](https://github.com/user-attachments/assets/8bc4d1c3-0fb6-489b-a4b6-4706f8fb2e76)

Torch compiled FLCE is 2x faster than the current FLCE

@Chillee By referencing https://github.com/linkedin/Liger-Kernel/blob/main/src/liger_kernel/ops/fused_linear_cross_entropy.py#L23. I mean, chaning chunk size in torch compiled FLCE. Your default chunk size is 1024, and I change to 256. Then I have: ![image](https://github.com/user-attachments/assets/06e43de9-bd5f-4d08-a00e-cef45fa1f80f) By only...

Torch compiled FLCE is 2x faster than the current FLCE

I have done some quick tests with different B, T, D and V to mimic my training conditions(llama3 and gemma2) in my env, my conclusion is that torch compiled flce...