Rui Wang
Rui Wang
> I also observed this resumption issue but not really sure if it happens with v11.0 as well (need more time to test on this). However, I have narrowed down...
I think I may have figured this out. Try setting `make_vocab_size_divisible_by` to `256`, which will reduce the chance of saving a corrupted checkpoint compared to the default `128`
Hi, I got the exact same message and set up GPU=1 database before (@milot-mirdita). The thing is that it worked a few days ago. Best, Rui
A bit of update: Looks like this shares the same issue with https://github.com/NVIDIA/nccl/issues/1338. However, our frabric manager was working just fine. As a result, we performed a cold reboot and...
Hi @YTianZHU , Thanks for the response! My bad I somehow skipped that row. If I understood correctly, this could mean that 256 is somewhat redundant in this case? Also,...
Can reproduce this bug. It seems torch2.1 works fine, but not 2.0.1
Hi Xin, Could we expect an ETA on this?