puppet101 comments

Results 25 comments of


                                            puppet101

Strange mesh in testing

@dldaisy can you share your testing code for in the wild images?

Some data are missing in the Mirrored-Human dataset

First, the number of video urls in the 'mirrored-human.json' is 270, but in the 'mirrored-human-base.zip', there are only 204 clips. Moreover, just using the name like 'raw_***', I can not...

Training the model failed

The batch size is 1, all of the parameters are unchanged. Can you train any model correctly using this code? It seems that the training goes wrong suddenly~ When I...

Can I fine tune GPT-Neo-XT-Chat-Base-20B with 8 A100?

Can I finetune the model on 8X V100 32GB GPUS with a smaller batch size?

使用zero3_offload+序列并行训练yi-34b的时候出错

您好，感谢回复，我这边试了一下8k的sp2，但是还是同样的问题，可以提供一下您那边的运行环境吗？我现在的配置文件是： [yi_34b_200k_full_alpaca_zh_32k_sp8.log](https://github.com/InternLM/xtuner/files/15056004/yi_34b_200k_full_alpaca_zh_32k_sp8.log) 运行环境是： deepspeed 0.14.1 transformers 4.40.0 xtuner 0.1.18.dev0 torch 2.0.0+cu118

使用zero3_offload+序列并行训练yi-34b的时候出错

您好，我这边确认问题了，我之前不论怎么改序列并行的设置，都会报一样的错误。我后来把deepspeed的版本从0.14.0降到0.12.3，就没问题了，感谢耐心的解答哈！另外我还有个问题，就是我这边虽然能跑起来了，但是我发现训练的步长有问题，我把的设置如下： sequence_parallel_size=8 batch_size = 1 accumulative_counts = 8 max_epochs = 3 使用alpaca_ch这个数据集，发现训练的总步数只有32，这个感觉不太对啊，alpaca-data-gpt4-chinese这个数据集，总共有5万多个样本，3个epoch，不应该总步数只有32的，辛苦帮忙看一下，谢谢!