ColossalAI issues

how to run it with 1080ti/P40, namely CC is 6.1

1

### Describe the feature how to run it with 1080ti/P40, namely CC is 6.1

SeekPoint

enhancement

[BUG]: --master_addr

2

### 🐛 Describe the bug while i use the command: "colossalai run --nproc_per_node 1 --master_addr GPU001 --master_port 29505 --host GPU001 main.py", it's not working. but the command "colossalai run --nproc_per_node...

bingokunkun

bug

[coati] How to get prompt_path and pretrain_dataset?

9

Hi, I want to reproduce the training process but have no two datasets. Do you have plans to open source datasets？ Thx. https://github.com/hpcaitech/ColossalAI/blob/638a07a7f9b504e6c9781e9aa2a9b6c5e9dc49ed/applications/Chat/examples/train_prompts.py#L208-L209

gongel

[BUG]: ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 3 (pid: 812917) of binary

4

### 🐛 Describe the bug ``` colossalai run --nproc_per_node=4 train_sft.py \ > --pretrain "/data/chenhao/train/ColossalAI/to/llama-7b-hf/" \ > --model 'llama' \ > --strategy colossalai_zero2 \ > --log_interval 10 \ > --save_path "/data/chenhao/train/ColossalAI/Coati-7B"...

twwch

bug

[BUG]: tensornvme installation is incomplete in official docker images

### 🐛 Describe the bug ### Description The official docker images run the [TensorNVME](https://github.com/hpcaitech/TensorNVMe) install commands, however at runtime, executing `cd TensorNVMe && tensornvme check` (or running the training demos...

MEllis-github

bug

I encountered a bug on importing "coati" while running "sh train_sft.sh" in "ColossalAI/applications/Chat/examples"[BUG]:

2

### 🐛 Describe the bug ColossalAI/applications/Chat/examples$ sh train_sft.sh WARNING:torch.distributed.run: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further...

JchenC361

bug

torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 6 (pid: 1268782) of binary: /usr/bin/python3[BUG]:

20

### 🐛 Describe the bug GPU: 8*A6000 CUDA Version: 11.7 Python Version: 3.8.10 colossalai Version: 0.2.8 when I train PPO by ``` torchrun --standalone --nproc_per_node=8 train_prompts.py \ --pretrain "decapoda-research/llama-7b-hf" \...

ifromeast

bug

How do I need to achieve: access my database and get the data statistics I want？

### Describe the feature First of all, thank you so much for sharing your project! At present, I have a requirement, which is as follows: First, I have a database,...

tensorflowt

enhancement

[booster] gemini plugin support shard checkpoint

## 📌 Checklist before creating the PR - [ yes ] I have created an issue for this PR for traceability - [ yes ] The title follows the standard...

flybird11111

Run Build and Test

API

[BUG]: ddp training in diffusion

1

### 🐛 Describe the bug how can i use the ddp train in diffusion? i saw the train_ddp.yaml，but there is nothing different with the train_colossalai.yaml. how do i set the...

zhangvia

bug

ColossalAI
ColossalAI copied to clipboard

Metadata

how to run it with 1080ti/P40, namely CC is 6.1

[BUG]: --master_addr

[coati] How to get prompt_path and pretrain_dataset?

[BUG]: ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 3 (pid: 812917) of binary

[BUG]: tensornvme installation is incomplete in official docker images

I encountered a bug on importing "coati" while running "sh train_sft.sh" in "ColossalAI/applications/Chat/examples"[BUG]:

torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 6 (pid: 1268782) of binary: /usr/bin/python3[BUG]:

How do I need to achieve: access my database and get the data statistics I want？

[booster] gemini plugin support shard checkpoint

[BUG]: ddp training in diffusion

← Metadata

Owner

Metadata

ColossalAI ColossalAI copied to clipboard

Metadata

← Metadata

Owner

Metadata

ColossalAI
ColossalAI copied to clipboard