Perry Li comments

Results 24 comments of


                                            Perry Li

pytorch 1.8 dont have the func "get_autocast_gpu_dtype()"

> Dose anyone solve it? I encounter the same error with torch-1.9.0. Update to torch-1.11 can solve this problem. torch-1.8 torch-1.9.0 do not have this function.

训练结果日志打印出epoch为0.5，1.0，1.5，2.0

如果按照step进行log，则会按照 n_step / len(dataloader) 的方式计算当前等价 epoch 数

[BUG/Help] <title> ChatGLM训练时每一步后有一段很长的h2d，这个memorycpy是在做什么？

跟模型本身没有关系，应该是训练框架自动进行的操作。比如，开启了 deepspeed 的 offload 策略，训练时会卸载优化器状态到内存，在更新参数时需要在 GPU Memory 和 CPU Memory 之间传递数据，就会拷贝数据。[DeepSpeed之ZeRO系列：将显存优化进行到底](https://zhuanlan.zhihu.com/p/513571706) 不同的优化策略会有不同的分块、offload 行为，具体要根据你的具体训练配置来看。

Default process group has not been initialized, please make sure to call init_process_group

使用 torchrun 分布式启动

Default process group has not been initialized, please make sure to call init_process_group

> > 使用 torchrun 分布式启动 > > 感谢大佬回复！能否再具体说明下步骤，要先改造下main.py为分布式，然后执行sh train.sh，再用torchrun分布式启动么？感谢大佬！！ https://github.com/THUDM/ChatGLM-6B/blob/main/ptuning/trainer.py#L1532 从官方默认的代码可以看到，只有training arg参数中local_rank!=-1的时候，才会启动ddp 应该是你的某个设置影响了训练参数中的设置，如果你没有多卡运行的需求，尝试手动设置命令行参数`--local_rank -1`试一下

[Feature] ChatGLM dropout

### Is your feature request related to a problem? Please describe. Hello, author I am working on ChatGLM full-parameter fine-tuning, but I found that there is no code for Dropout...

stream_chat是在哪里定义的？

去 huggingface 模型仓库看，1293行 https://huggingface.co/THUDM/chatglm-6b/blob/1d240ba371910e9282298d4592532d7f0f3e9f3e/modeling_chatglm.py#L1293

[Help] <使用deepspeed全量模型微调，内存不够用>

怎么看上去更像是磁盘没空间了

[Help] <使用deepspeed全量模型微调，内存不够用>

> > 怎么看上去更像是磁盘没空间了 > > 请问你跑成功过吗？当时用了多少显卡、多少内存、多少硬盘呢？ 8*A800 运行时100多G的内存就行了。硬盘开销没多少，一个ckpt带上别的参数也不到10G

[Help] <使用deepspeed全量模型微调，内存不够用>

> > > > 怎么看上去更像是磁盘没空间了 > > > > > > > > > 请问你跑成功过吗？当时用了多少显卡、多少内存、多少硬盘呢？ > > > > > > 8*A800 运行时100多G的内存就行了。硬盘开销没多少，一个ckpt带上别的参数也不到10G > > ![image](https://user-images.githubusercontent.com/45615979/242775623-f0b8f620-6a87-4b2b-a7da-cf9edb217a45.png) > > 我有5个RTX3090, 304GB内存，1.9TB硬盘，...