ColossalAI issues

Fix installation errors and the example for limited resources

## 📌 Checklist before creating the PR - [x] I have created an issue for this PR for traceability - [ ] The title follows the standard format: `[doc/gemini/tensor/...]: A...

sherlcok314159

[BUG]: ZeroOptimizer in pipeline will stuck when only serval layers have parameters to be optimized

1

### 🐛 Describe the bug I using this configuration as example ```python plugin = HybridParallelPlugin( tp_size=2, pp_size=2, zero_stage=1, microbatch_size=1, num_microbatches=None, enable_jit_fused=False, enable_fused_normalization=True, enable_flash_attention=True, precision=mixed_precision, initial_scale=1, ) ``` The parameters needed...

zeyugao

bug

when I uese hybrid_parallel, and set the enable_fused_normalization = True. I can't run the code, here are some error: RuntimeError: Failed to replace input_layernorm of type LlamaRMSNorm with FusedRMSNorm with the exception: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMS normalization kernel. Please check your model configuration or sharding policy, you can set up an issue for us to help you as well. However, I have install the apex, it will still occur. How can i solve it?

9

### 🐛 Describe the bug raise RuntimeError( RuntimeError: Failed to replace input_layernorm of type LlamaRMSNorm with FusedRMSNorm with the exception: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused...

chensimian

bug

[BUG]:From server.py: ValueError: The following `model_kwargs` are not used by the model: ['token_type_ids']

6

### 🐛 Describe the bug I run my server with this: python3 ./ColossalAI/applications/Chat/inference/server.py /home/ubuntu/modelpath/llama-7b/llama-7b/ --quant 8bit --http_host 0.0.0.0 --http_port 8080 then I call the api with this: import requests import...

balcklive

bug

[BUG]: llama2 training on multi-node slurm, report error: errno: 98 - Address already in use

3

### 🐛 Describe the bug ``` RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:29601 (errno: 98...

yeegnauh

bug

[BUG]: PIDM model is not compatible with gemini ddp

2

### 🐛 Describe the bug when i use booster api and gemini plugin to train the PIDM, this error happens: ```python File "train.py", line 167, in train booster.backward(loss, optimizer) File...

zhangvia

bug

[BUG]: Fine-tune Colossal-LLaMA-2 error

1

### 🐛 Describe the bug I run `colossalai run --nproc_per_node 8 finetune.py \ --plugin "gemini_auto" \ --dataset "/home/pdl/xlz/ColossalAI/data" \ --model_path "/home/pdl/xlz/pretrain_weights/Colossal-LLaMA-2-7b-base" \ --task_name "qaAll_final.jsonl" \ --save_dir "./output" \ --flash_attention \...

xielinzhen

bug

多机多卡训练容易超时，超时的话如何自动从已经保存的模型恢复训练？

2

### Discussed in https://github.com/hpcaitech/ColossalAI/discussions/5027 Originally posted by **jiejie1993** November 8, 2023 多机多卡训练过程中，发生NCCL timeout超时，在torch中有--max-restarts对训练进行重启，但是如何去自动加载最新的已经保存的模型？使用--load-checkpoint需要多节点都有这个保存的模型，但训练中只会在master节点保存模型，手动复制到所有节点的话无法实现训练自动重启，有没有什么办法实现自动重启中断的训练，并从已经保存的最新模型恢复的功能？

jiejie1993

[DOC]: How to use sgd optimizer?

### 📚 The doc issue I want to replace adam with sgd in [Colossal-LLaMA-2](https://github.com/hpcaitech/ColossalAI/tree/main/applications/Colossal-LLaMA-2) because I don't have enough gpu but have time to adjust hyper-parameters. Is there any examples...

fancyerii

documentation

Question about text preprocess in examples/language/llama2 and applications/Colossal-LLaMA-2

2

### Describe the feature I found both the two examples will truncate text longer than max_length. So we have to segment long text to short ones. For examples/language/llama2, the codes...

fancyerii

enhancement

ColossalAI
ColossalAI copied to clipboard

Metadata

Fix installation errors and the example for limited resources

[BUG]: ZeroOptimizer in pipeline will stuck when only serval layers have parameters to be optimized

[BUG]:From server.py: ValueError: The following `model_kwargs` are not used by the model: ['token_type_ids']

[BUG]: llama2 training on multi-node slurm, report error: errno: 98 - Address already in use

[BUG]: PIDM model is not compatible with gemini ddp

[BUG]: Fine-tune Colossal-LLaMA-2 error

多机多卡训练容易超时，超时的话如何自动从已经保存的模型恢复训练？

[DOC]: How to use sgd optimizer?

Question about text preprocess in examples/language/llama2 and applications/Colossal-LLaMA-2

← Metadata

Owner

Metadata

ColossalAI ColossalAI copied to clipboard

Metadata

← Metadata

Owner

Metadata

ColossalAI
ColossalAI copied to clipboard