Hongxin Liu comments

Results 70 comments of


                                            Hongxin Liu

Fix ColoTensorSpec for py11

Hi, please rebase the main branch as we've fixed some bugs

Can not train llama-7b due to OOM on 40GA100

> @Fazziekey @FrankLeeeee Same OOM issue. The same is A100 40GB, 1 gpu running llama7B model, batch=1, max_seq_len=512, colossalai_zero2 placement_policy='cuda', use torch.cuda.memory_allocated() to analyze memory usage, in SFTTrainer self.optimizer =...

Can not train llama-7b due to OOM on 40GA100

I've added "colossalai_zero2_cpu" strategy for this script. I tested on 4x 40GA100 and it works.

[tests] model zoo add torchaudio models

> Can you update the test files in `tests/test_fx/test_tracer/test_torchaudio_model`? DONE

Small memory saving on NVME[BUG]:

How many nodes did you use?

[BUG]: nvmeoffload

Hi, it only offload optimizer states to disk. It seems your error is not related to optimizer.

[BUG]: nvmeoffload

Can you provide more info?

[BUG]: pip install . in the ChatGPT

What is your python env?

[BUG]: Conflict with Deepspeed and colossalAI for fused adam

How did you install colossalai? There is no `op_builder/` folder in my `site-packages/` folder. ![image](https://user-images.githubusercontent.com/23111350/220556227-4644edd4-7c42-4374-acb2-010b5350a03e.png)

[FEATURE]: Any plan to support train_dreambooth_colossalai with train_text_encoder?

@Fazziekey Could you answer this question?