torchtune Organise steps logic

The idea is showing:

A progress bar with the actual total count
Having the same steps logged and reported on the progress bar
Count a training step as an optimizer step to be consistent with other libraries like transformers.Trainer and Axolotl.
Add some QPS and grad norm

Apr 15 '24 16:04 tcapelle

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/730

:page_facing_up: Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Apr 15 '24 16:04 pytorch-bot[bot]

Are these the types of QPS you were expecting @msaroufim ?

Apr 15 '24 19:04 tcapelle

Running this command:

tune run full_finetune_single_device --config my_dev/tiny_llama.yaml \
  log_every_n_steps=5 \
  max_steps_per_epoch=10 \
  epochs=2

you get a progress bar with only 10 steps:

INFO:torchtune.utils.logging:Dataset and Sampler are initialized.
1|10|Loss: 10.452272415161133: 100%|█████████████████████████████████████████████████████████████████| 10/10 [00:03<00:00,  3.65it/s]INFO:torchtune.utils.logging:Model checkpoint of size 0.08 GB saved to my_dev/tiny_llama/checkpoints/torchtune_model_0.pt
INFO:torchtune.utils.logging:Recipe checkpoint of size 0.16 GB saved to my_dev/tiny_llama/checkpoints/recipe_state.pt
1|10|Loss: 10.452272415161133: 100%|█████████████████████████████████████████████████████████████████| 10/10 [00:03<00:00,  2.85it/s]
  0%|                                                                                                         | 0/10 [00:00<?, ?it/sINFO:torchtune.utils.logging:Model checkpoint of size 0.08 GB saved to my_dev/tiny_llama/checkpoints/torchtune_model_1.pt0,  1.72it/s]
2|20|Loss: 10.449804306030273: 100%|█████████████████████████████████████████████████████████████████| 10/10 [00:05<00:00,  1.78it/s]

and you get only 2 logs per epoch:

Step 0 | loss:10.565682411193848 lr:2e-05 gpu_resources:0 tokens_per_second:456.04617469323597 iterations_per_second:1.6523412126566521 grad_norm:16.26054121046421 
Step 5 | loss:10.421597480773926 lr:2e-05 gpu_resources:0 tokens_per_second:620.2041065388681 iterations_per_second:3.828420410733754 grad_norm:23.697011790378774 
Step 10 | loss:10.490856170654297 lr:2e-05 gpu_resources:0 tokens_per_second:376.36752451194593 iterations_per_second:1.344169730399807 grad_norm:15.191259666319432 
Step 15 | loss:10.376498222351074 lr:2e-05 gpu_resources:0 tokens_per_second:1275.976294664742 iterations_per_second:3.2385185143775175 grad_norm:12.678521182631085

the gradient accumulation steps actually don't change any of this, only the total length of the total steps.

Apr 17 '24 11:04 tcapelle

Confirmed that this works as expected with optimizer_in_bwd=True and gradient accumulation off, and optimizer_in_bwd=False and gradient accumulation on:

tune run full_finetune_single_device --config llama2/7B_full_low_memory optimizer_in_bwd=False gradient_accumulation_steps=5 max_steps_per_epoch=10 tune run full_finetune_single_device --config llama2/7B_full_low_memory

Apr 17 '24 21:04 RdoubleA

Hello guys, what else do we need here? Can I add these changes to the other recipes?

Apr 22 '24 09:04 tcapelle

Hello guys, what else do we need here? Can I add these changes to the other recipes?

Sorry for the delay in responding. For the sake of not writing the same reply in 20 places, will just link to my comment here. Happy to let you take over that PR or absorb the changes into this one, either is fine with me.

Apr 22 '24 14:04 ebsmothers

I added the changes you put on running and computing the right amount of tokens. I am not sure the GPU metrics is something you want to log very often as it is expensive to compute. Do you know wy the grad accum test doesn't pass?

Apr 22 '24 15:04 tcapelle

I am not sure the GPU metrics is something you want to log very often as it is expensive to compute.

We could consider adding a flag like log_memory_stats to the config to enable/disable this. What do you think?

Do you know wy the grad accum test doesn't pass?

It has to do with how we infer the loss for the test based on the logs (see here). Previously, whenever self.total_training_steps % self._log_every_n_steps == 0, we would log on every iteration of that step, so we would actually get a list of loss values (hence the np.mean in L239 of the above code pointer). With your change we only logged on the final iteration of the step, so the logged value would just be the loss on the last iteration, not the average over all the iterations. This is why I added running loss instead, to accumulate losses over each iteration and then log the mean for the step.

Apr 22 '24 19:04 ebsmothers

closing this as it was addressed in #831

Apr 25 '24 06:04 RdoubleA