Rui Wang comments

Results 47 comments of


                                            Rui Wang

[BUG] Is mcore-0.12.0 checkpoint resume correct?

> I also observed this resumption issue but not really sure if it happens with v11.0 as well (need more time to test on this). However, I have narrowed down...

[BUG] Is mcore-0.12.0 checkpoint resume correct?

I think I may have figured this out. Try setting `make_vocab_size_divisible_by` to `256`, which will reduce the chance of saving a corrupted checkpoint compared to the default `128`

CUDA error: invalid configuration argument : /work/lib/libmarv/src/pssm.cuh

Hi, I got the exact same message and set up GPU=1 database before (@milot-mirdita). The thing is that it worked a few days ago. Best, Rui

CUDA error: invalid configuration argument : /work/lib/libmarv/src/pssm.cuh

A bit of update: Looks like this shares the same issue with https://github.com/NVIDIA/nccl/issues/1338. However, our frabric manager was working just fine. As a result, we performed a cold reboot and...

Ablation tests with the same headdim as v with differential transformers

Hi @YTianZHU , Thanks for the response! My bad I somehow skipped that row. If I understood correctly, this could mean that 256 is somewhat redundant in this case? Also,...

Installation fails (due to recent change?)

Can reproduce this bug. It seems torch2.1 works fine, but not 2.0.1

How can we integrate the DeepGEMM Fp8 GEMM implementation in TE's block-wise scaling？

Hi Xin, Could we expect an ETA on this?