zero-bubble-pipeline-parallelism icon indicating copy to clipboard operation
zero-bubble-pipeline-parallelism copied to clipboard

[QUESTION]1f1b is fast then zero-v

Open kuangdao opened this issue 8 months ago • 0 comments

i test llama2 13b on a800, the pp parallelism is 4 and micro-batch-size = 1 and global-batch-size = 64 the 1f1b log, i just use 1f1b, not use vp

iteration 1/ 500000 | consumed samples: 64 | elapsed time per iteration (ms): 23376.6 | learning rate: 4.687E-08 | global batch size: 64 | lm loss: 1.123916E+01 | loss scale: 1.0 | grad norm: 121.332 | number of skipped iterations: 0 | number of nan iterations: 0 | iteration 2/ 500000 | consumed samples: 128 | elapsed time per iteration (ms): 15149.3 | learning rate: 9.375E-08 | global batch size: 64 | lm loss: 1.138808E+01 | loss scale: 1.0 | grad norm: 15.865 | number of skipped iterations: 0 | number of nan iterations: 0 | iteration 3/ 500000 | consumed samples: 192 | elapsed time per iteration (ms): 15153.5 | learning rate: 1.406E-07 | global batch size: 64 | lm loss: 1.138511E+01 | loss scale: 1.0 | grad norm: 15.744 | number of skipped iterations: 0 | number of nan iterations: 0 | iteration 4/ 500000 | consumed samples: 256 | elapsed time per iteration (ms): 15154.5 | learning rate: 1.875E-07 | global batch size: 64 | lm loss: 1.131369E+01 | loss scale: 1.0 | grad norm: 62.191 | number of skipped iterations: 0 | number of nan iterations: 0 |

the zero-v log

iteration 1/ 500000 | consumed samples: 64 | elapsed time per iteration (ms): 23561.4 | learning rate: 4.687E-08 | global batch size: 64 | lm loss: 1.037349E+01 | loss scale: 1.0 | grad norm: 2.278 | number of skipped iterations: 0 | number of nan iterations: 0 | iteration 2/ 500000 | consumed samples: 128 | elapsed time per iteration (ms): 15432.7 | learning rate: 9.375E-08 | global batch size: 64 | lm loss: 1.037349E+01 | loss scale: 1.0 | grad norm: 0.453 | number of skipped iterations: 0 | number of nan iterations: 0 | iteration 3/ 500000 | consumed samples: 192 | elapsed time per iteration (ms): 16140.2 | learning rate: 1.406E-07 | global batch size: 64 | lm loss: 1.037348E+01 | loss scale: 1.0 | grad norm: 0.442 | number of skipped iterations: 0 | number of nan iterations: 0 | iteration 4/ 500000 | consumed samples: 256 | elapsed time per iteration (ms): 16202.1 | learning rate: 1.875E-07 | global batch size: 64 | lm loss: 1.037344E+01 | loss scale: 1.0 | grad norm: 1.198 | number of skipped iterations: 0 | number of nan iterations: 0 |

the zero-v i use this:

    --zero-bubble-v-schedule \
    --allow-padding-num-layers \

kuangdao avatar Jun 13 '24 07:06 kuangdao