ColossalAI
ColossalAI copied to clipboard
Making large AI models cheaper, faster and more accessible
### 🐛 Describe the bug when I set the lora_rank in example/train_sft.sh to 8, the bug happens as following: Traceback (most recent call last): File "/home/chaojiewang/NeurIPS_2023/Chatgpt/coati/train_sft.py", line 185, in train(args)...
### 🐛 Describe the bug I adapt the [example](https://github.com/hpcaitech/ColossalAI/tree/main/examples/images/dreambooth) by replacing `export MODEL_NAME="CompVis/stable-diffusion-v1-4"` with `export MODEL_NAME="stabilityai/stable-diffusion-2"`, then run the script and got following error. ``` RuntimeError: false INTERNAL ASSERT FAILED...
### 📚 The doc issue is there any examples is running at multi node?
### 🐛 Describe the bug Code: ------------------------------------------------------------ torchrun --standalone --nproc_per_node=1 train_reward_model.py --dataset Dahoas/rm-static --subset ../../../datasets/Dahoas_rm-static --max_len 512 --model gpt2 --pretrain ../../../gpt2/gpt2-small --lora_rank 0 --max_epochs 1 --batch_size 1 --loss_fn log_sig --test...
### 🐛 Describe the bug (ColossalAI-Chat) tt@visiondev-SYS-4029GP-TRT:/data3/samba_css/chatgpt/ColossalAI/applications/Chat/examples$ colossalai check -i /home/tt/anaconda3/envs/ColossalAI-Chat/lib/python3.10/site-packages/torch/library.py:130: UserWarning: Overriding a previously registered kernel for the same operator and the same dispatch key operator: aten::eye.m_out(int n, int...
### 🐛 Describe the bug LlamaRM is not a huggingface transformer module but LoraModule, while llamaRM.model is a huggingface transformer model. So LlamaRm has no function "resize_token_embeddings" but LlamaRM.model has....
#### GPU 40G*A100*8 I want to train the 7B model of Llama on 40GA100, but it prompts that there is not enough GPU memory. The training command is: `torchrun --standalone...
### 🐛 Describe the bug when i use the "colossalai_zero" strategy to train the RM model, it will spend a lot of time to load optimizer . I am very...
### 🐛 Describe the bug When run the Stage3 code https://github.com/hpcaitech/ColossalAI/blob/main/applications/Chat/examples/train_prompts.py using LLaMA, this bug is encountered in line 137: ` tokenizer = prepare_llama_tokenizer_and_embedding(tokenizer, actor)` The detailed of this bug:...
### 🐛 Describe the bug when training GPT2-S using a single card on colab, `!torchrun --standalone --nproc_per_node 1 benchmark_gpt_dummy.py --model s --strategy colossalai_gemini_cpu --experience_batch_size 1 --train_batch_size 1` meetting a bug...