Rui Wang

Results 47 comments of Rui Wang

> I also observed this resumption issue but not really sure if it happens with v11.0 as well (need more time to test on this). However, I have narrowed down...

I think I may have figured this out. Try setting `make_vocab_size_divisible_by` to `256`, which will reduce the chance of saving a corrupted checkpoint compared to the default `128`

Hi, I got the exact same message and set up GPU=1 database before (@milot-mirdita). The thing is that it worked a few days ago. Best, Rui

A bit of update: Looks like this shares the same issue with https://github.com/NVIDIA/nccl/issues/1338. However, our frabric manager was working just fine. As a result, we performed a cold reboot and...

Hi @YTianZHU , Thanks for the response! My bad I somehow skipped that row. If I understood correctly, this could mean that 256 is somewhat redundant in this case? Also,...

Can reproduce this bug. It seems torch2.1 works fine, but not 2.0.1