ColossalAI
                                
                                 ColossalAI copied to clipboard
                                
                                    ColossalAI copied to clipboard
                            
                            
                            
                        Making large AI models cheaper, faster and more accessible
## 📌 Checklist before creating the PR - [x] I have created an issue for this PR for traceability - [ ] The title follows the standard format: `[doc/gemini/tensor/...]: A...
[BUG]: ZeroOptimizer in pipeline will stuck when only serval layers have parameters to be optimized
### 🐛 Describe the bug I using this configuration as example ```python plugin = HybridParallelPlugin( tp_size=2, pp_size=2, zero_stage=1, microbatch_size=1, num_microbatches=None, enable_jit_fused=False, enable_fused_normalization=True, enable_flash_attention=True, precision=mixed_precision, initial_scale=1, ) ``` The parameters needed...
### 🐛 Describe the bug raise RuntimeError( RuntimeError: Failed to replace input_layernorm of type LlamaRMSNorm with FusedRMSNorm with the exception: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused...
### 🐛 Describe the bug I run my server with this: python3 ./ColossalAI/applications/Chat/inference/server.py /home/ubuntu/modelpath/llama-7b/llama-7b/ --quant 8bit --http_host 0.0.0.0 --http_port 8080 then I call the api with this: import requests import...
### 🐛 Describe the bug ``` RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:29601 (errno: 98...
### 🐛 Describe the bug when i use booster api and gemini plugin to train the PIDM, this error happens: ```python File "train.py", line 167, in train booster.backward(loss, optimizer) File...
### 🐛 Describe the bug I run `colossalai run --nproc_per_node 8 finetune.py \ --plugin "gemini_auto" \ --dataset "/home/pdl/xlz/ColossalAI/data" \ --model_path "/home/pdl/xlz/pretrain_weights/Colossal-LLaMA-2-7b-base" \ --task_name "qaAll_final.jsonl" \ --save_dir "./output" \ --flash_attention \...
### Discussed in https://github.com/hpcaitech/ColossalAI/discussions/5027 Originally posted by **jiejie1993** November 8, 2023 多机多卡训练过程中,发生NCCL timeout超时,在torch中有--max-restarts对训练进行重启,但是如何去自动加载最新的已经保存的模型?使用--load-checkpoint需要多节点都有这个保存的模型,但训练中只会在master节点保存模型,手动复制到所有节点的话无法实现训练自动重启,有没有什么办法实现自动重启中断的训练,并从已经保存的最新模型恢复的功能?
### 📚 The doc issue I want to replace adam with sgd in [Colossal-LLaMA-2](https://github.com/hpcaitech/ColossalAI/tree/main/applications/Colossal-LLaMA-2) because I don't have enough gpu but have time to adjust hyper-parameters. Is there any examples...
### Describe the feature I found both the two examples will truncate text longer than max_length. So we have to segment long text to short ones. For examples/language/llama2, the codes...