ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

Making large AI models cheaper, faster and more accessible

Results 1072 ColossalAI issues
Sort by recently updated
recently updated
newest added

## 📌 Checklist before creating the PR - [x] I have created an issue for this PR for traceability - [ ] The title follows the standard format: `[doc/gemini/tensor/...]: A...

### Proposal https://github.com/hpcaitech/ColossalAI/blob/d7bf284706ef256c38d3aad53142b07cfc0fc10e/applications/Chat/coati/trainer/ppo.py#L210 https://github.com/hpcaitech/ColossalAI/blob/d7bf284706ef256c38d3aad53142b07cfc0fc10e/applications/Chat/coati/trainer/ppo.py#L213 Probably a simple mistake, it need to clear one. ### Self-service - [X] I'd be willing to do some initial work on this proposal myself.

enhancement

### 🐛 Describe the bug [https://github.com/hpcaitech/ColossalAI/blob/b0ce5a10326912961f0bc07cbbd250bab7b9c399/applications/Chat/coati/models/base/critic.py#L45-L50](https://github.com/hpcaitech/ColossalAI/blob/b0ce5a10326912961f0bc07cbbd250bab7b9c399/applications/Chat/coati/models/base/critic.py#L45-L50) We should use the generation part of the sequenses to compute value. So the indexing should be `[-num_actions:]` instead of `[:-num_actions]` and using `action_mask`....

bug

## 📌 Checklist before creating the PR - [x] I have created an issue for this PR for traceability - [x] The title follows the standard format: `[doc/gemini/tensor/...]: A concise...

### 🐛 Describe the bug Missing definition for prompt_sampler and pretrain_sampler in `examples/train_prompts.py` when dist.get_world_size() == 1. ### Environment _No response_

bug

### 🐛 Describe the bug ``` wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s) wandb: Waiting for W&B process to finish... (success). wandb:...

bug

## 📌 Checklist before creating the PR - [ ] I have created an issue for this PR for traceability - [ ] The title follows the standard format: `[doc/gemini/tensor/...]:...

I am trying to fine-tune the Llama 13B model using Colossalai. However, the memory usage is quite high, exceeding 270B, and causing an OOM error directly. Is there any way...

## 📌 Checklist before creating the PR - [ ] I have created an issue for this PR for traceability - [x] The title follows the standard format: `[doc/gemini/tensor/...]: A...

### 🐛 Describe the bug ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 514946) of binary: [E ProcessGroupNCCL.cpp:821] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=27962, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1807582 milliseconds...

bug