ColossalAI issues

Del: Remove duplicate functions

## 📌 Checklist before creating the PR - [x] I have created an issue for this PR for traceability - [ ] The title follows the standard format: `[doc/gemini/tensor/...]: A...

ddobokki

[PROPOSAL]: [save_model] function in [ppo.py] is duplicated

### Proposal https://github.com/hpcaitech/ColossalAI/blob/d7bf284706ef256c38d3aad53142b07cfc0fc10e/applications/Chat/coati/trainer/ppo.py#L210 https://github.com/hpcaitech/ColossalAI/blob/d7bf284706ef256c38d3aad53142b07cfc0fc10e/applications/Chat/coati/trainer/ppo.py#L213 Probably a simple mistake, it need to clear one. ### Self-service - [X] I'd be willing to do some initial work on this proposal myself.

ddobokki

enhancement

[BUG]: [chat] Using wrong part of values to compute value

### 🐛 Describe the bug [https://github.com/hpcaitech/ColossalAI/blob/b0ce5a10326912961f0bc07cbbd250bab7b9c399/applications/Chat/coati/models/base/critic.py#L45-L50](https://github.com/hpcaitech/ColossalAI/blob/b0ce5a10326912961f0bc07cbbd250bab7b9c399/applications/Chat/coati/models/base/critic.py#L45-L50) We should use the generation part of the sequenses to compute value. So the indexing should be `[-num_actions:]` instead of `[:-num_actions]` and using `action_mask`....

zhang-yi-chi

bug

[chat] fix single gpu training bug in examples/train_prompts.py

## 📌 Checklist before creating the PR - [x] I have created an issue for this PR for traceability - [x] The title follows the standard format: `[doc/gemini/tensor/...]: A concise...

zhang-yi-chi

[BUG]: [chat] Unable to run train_prompts.sh by single card

### 🐛 Describe the bug Missing definition for prompt_sampler and pretrain_sampler in `examples/train_prompts.py` when dist.get_world_size() == 1. ### Environment _No response_

zhang-yi-chi

bug

[BUG]: Is it normal to have loss nan after the Stage 1 - Supervised Finetuning?

19

### 🐛 Describe the bug ``` wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s) wandb: Waiting for W&B process to finish... (success). wandb:...

alibabadoufu

bug

[Only for test]Chat update ci

## 📌 Checklist before creating the PR - [ ] I have created an issue for this PR for traceability - [ ] The title follows the standard format: `[doc/gemini/tensor/...]:...

Camille7777

Optimize the memory usage

4

I am trying to fine-tune the Llama 13B model using Colossalai. However, the memory usage is quite high, exceeding 270B, and causing an OOM error directly. Is there any way...

heshuguo

Fixed several spelling errors under colossalai

8

## 📌 Checklist before creating the PR - [ ] I have created an issue for this PR for traceability - [x] The title follows the standard format: `[doc/gemini/tensor/...]: A...

digger-yu

[BUG]: ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 514946) of binary:

14

### 🐛 Describe the bug ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 514946) of binary: [E ProcessGroupNCCL.cpp:821] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=27962, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1807582 milliseconds...

Haoran1234567

bug

ColossalAI
ColossalAI copied to clipboard

Metadata

Del: Remove duplicate functions

[PROPOSAL]: [save_model] function in [ppo.py] is duplicated

[BUG]: [chat] Using wrong part of values to compute value

[chat] fix single gpu training bug in examples/train_prompts.py

[BUG]: [chat] Unable to run train_prompts.sh by single card

[BUG]: Is it normal to have loss nan after the Stage 1 - Supervised Finetuning?

[Only for test]Chat update ci

Optimize the memory usage

Fixed several spelling errors under colossalai

[BUG]: ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 514946) of binary:

← Metadata

Owner

Metadata

ColossalAI ColossalAI copied to clipboard

Metadata

← Metadata

Owner

Metadata

ColossalAI
ColossalAI copied to clipboard