Lin Chenjian
Lin Chenjian
I also need to extract p.grad for subsequent calculations. Is there any way to get p.grad correctly? I have read the above code but still don't know how to do...
> @xbyym 可以在这一行[https://github.com/hpcaitech/Open-Sora/blob/main/scripts/train.py#L264下面插入`print(batch)`](https://github.com/hpcaitech/Open-Sora/blob/main/scripts/train.py#L264%E4%B8%8B%E9%9D%A2%E6%8F%92%E5%85%A5%60print(batch)%60) 看看 我也遇到了相同的问题,在大多数轮次时我print(batch)不包含任何数据,极个别epoch可以正常进行训练,这是为什么?
I have meet the same question
> I have meet the same question i have solve this problem.you can change buck_config to smaller batch_size,and it can work.
you can find it in /Open-Sora/configs/opensora-v1-2/train/stage1.py ``` bucket_config = { # 12s/it "144p": {1: (1.0, 475), 51: (1.0, 2), 102: ((1.0, 0.33), 2), 204: ((1.0, 0.1), 13), 408: ((1.0, 0.1),...
> > buck_config > > pleas > > > > I have meet the same question > > > > > > i have solve this problem.you can change buck_config...
Sorry to bother you, could you please describe it in more detail? Because I am using the 0.3.6 version of colossalai, I put the following code in the corresponding position...
> Please share a minimum script to reproduce the error. Your code is wrong as _run_reduction reduces grads for all bucketed parameters. As far as I can tell, non-trainable params...
> You can get the grads this way, described in the issue you mentioned [hpcaitech/Open-Sora#283 (comment)](https://github.com/hpcaitech/Open-Sora/issues/283#issuecomment-2185800300) I have read the above code before, but it did not involve zero_optizer in...
> Does your training code involve an optimizer? That's what you're looking for Sorry to bother you again, I will refine my question. The following is a minimal reproduction of...