ColossalAI
ColossalAI copied to clipboard
Making large AI models cheaper, faster and more accessible
## 📌 Checklist before creating the PR - [x] I have created an issue for this PR for traceability - [ ] The title follows the standard format: `[doc/gemini/tensor/...]: A...
### Proposal https://github.com/hpcaitech/ColossalAI/blob/d7bf284706ef256c38d3aad53142b07cfc0fc10e/applications/Chat/coati/trainer/ppo.py#L210 https://github.com/hpcaitech/ColossalAI/blob/d7bf284706ef256c38d3aad53142b07cfc0fc10e/applications/Chat/coati/trainer/ppo.py#L213 Probably a simple mistake, it need to clear one. ### Self-service - [X] I'd be willing to do some initial work on this proposal myself.
### 🐛 Describe the bug [https://github.com/hpcaitech/ColossalAI/blob/b0ce5a10326912961f0bc07cbbd250bab7b9c399/applications/Chat/coati/models/base/critic.py#L45-L50](https://github.com/hpcaitech/ColossalAI/blob/b0ce5a10326912961f0bc07cbbd250bab7b9c399/applications/Chat/coati/models/base/critic.py#L45-L50) We should use the generation part of the sequenses to compute value. So the indexing should be `[-num_actions:]` instead of `[:-num_actions]` and using `action_mask`....
## 📌 Checklist before creating the PR - [x] I have created an issue for this PR for traceability - [x] The title follows the standard format: `[doc/gemini/tensor/...]: A concise...
### 🐛 Describe the bug Missing definition for prompt_sampler and pretrain_sampler in `examples/train_prompts.py` when dist.get_world_size() == 1. ### Environment _No response_
### 🐛 Describe the bug ``` wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s) wandb: Waiting for W&B process to finish... (success). wandb:...
## 📌 Checklist before creating the PR - [ ] I have created an issue for this PR for traceability - [ ] The title follows the standard format: `[doc/gemini/tensor/...]:...
I am trying to fine-tune the Llama 13B model using Colossalai. However, the memory usage is quite high, exceeding 270B, and causing an OOM error directly. Is there any way...
## 📌 Checklist before creating the PR - [ ] I have created an issue for this PR for traceability - [x] The title follows the standard format: `[doc/gemini/tensor/...]: A...
### 🐛 Describe the bug ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 514946) of binary: [E ProcessGroupNCCL.cpp:821] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=27962, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1807582 milliseconds...