Ofey Chan
Ofey Chan
[src/ray/raylet/node_manager.cc](https://github.com/ray-project/ray/blob/e33edcb0b7d2f9762b173a34ba64b41079592978/src/ray/raylet/node_manager.cc#L896) ```cpp for (const auto &worker : worker_pool_.GetAllRegisteredWorkers()) { if (worker->IsAvailableForScheduling()) { // Progress is being made in a task, don't warn. resource_deadlock_warned_ = 0; return; } } ``` Maybe...
Oh no, I just set `git config pull.rebase true` and did something strange! - I would rollback it quickly...
> Looks like lots of tests are failing Maybe the comment `// No assigned task` is wrong, but the condition is correct? I discussed this possibility in #27727 , [comment...
> cc @ofey404 can you change the func name and merge the latest master? Sure.
> cpp tests failing! I would rebase to the latest release and trigger CI again.
Okay, thank you for the help! I might be a beginner and I'm glad to check those links.
Hey everybody! I managed to support (limited) Tensor Parallelism, check it by running: ```bash torchrun --standalone --nnodes 1 --nproc_per_node 4 main_pretrain.py --config ./config/pretrain_1d_tp2.py ``` I tune the model inside `models_mae_tp.py`....
Add save & load model functionality, with `colossalai.utils.checkpointing`.
> Hi @ofey404, thank you for your contribution! Would you please provide train logs in different parallelism settings? Several epochs or a full run? A full 800 epochs run might...
[ImageNet100 on kaggle](https://www.google.com.hk/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&ved=2ahUKEwig3fqzlaL3AhVMG6YKHfIbB2AQFnoECAwQAQ&url=https%3A%2F%2Fwww.kaggle.com%2Fambityga%2Fimagenet100&usg=AOvVaw15mPBDGlkrFTdReqX1-gD-) is 16 GB while ImageNet1000 I used is only 2 GB. CIFAR10 might be a good candidate for basic validation. The problem is that, the original pretrain...