Organise steps logic
The idea is showing:
- A progress bar with the actual total count
- Having the same steps logged and reported on the progress bar
- Count a training step as an optimizer step to be consistent with other libraries like transformers.Trainer and Axolotl.
- Add some QPS and grad norm
:link: Helpful Links
:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/730
- :page_facing_up: Preview Python docs built from this PR
Note: Links to docs will display an error until the docs builds have been completed.
This comment was automatically generated by Dr. CI and updates every 15 minutes.
Are these the types of QPS you were expecting @msaroufim ?
Running this command:
tune run full_finetune_single_device --config my_dev/tiny_llama.yaml \
log_every_n_steps=5 \
max_steps_per_epoch=10 \
epochs=2
you get a progress bar with only 10 steps:
INFO:torchtune.utils.logging:Dataset and Sampler are initialized.
1|10|Loss: 10.452272415161133: 100%|█████████████████████████████████████████████████████████████████| 10/10 [00:03<00:00, 3.65it/s]INFO:torchtune.utils.logging:Model checkpoint of size 0.08 GB saved to my_dev/tiny_llama/checkpoints/torchtune_model_0.pt
INFO:torchtune.utils.logging:Recipe checkpoint of size 0.16 GB saved to my_dev/tiny_llama/checkpoints/recipe_state.pt
1|10|Loss: 10.452272415161133: 100%|█████████████████████████████████████████████████████████████████| 10/10 [00:03<00:00, 2.85it/s]
0%| | 0/10 [00:00<?, ?it/sINFO:torchtune.utils.logging:Model checkpoint of size 0.08 GB saved to my_dev/tiny_llama/checkpoints/torchtune_model_1.pt0, 1.72it/s]
2|20|Loss: 10.449804306030273: 100%|█████████████████████████████████████████████████████████████████| 10/10 [00:05<00:00, 1.78it/s]
and you get only 2 logs per epoch:
Step 0 | loss:10.565682411193848 lr:2e-05 gpu_resources:0 tokens_per_second:456.04617469323597 iterations_per_second:1.6523412126566521 grad_norm:16.26054121046421
Step 5 | loss:10.421597480773926 lr:2e-05 gpu_resources:0 tokens_per_second:620.2041065388681 iterations_per_second:3.828420410733754 grad_norm:23.697011790378774
Step 10 | loss:10.490856170654297 lr:2e-05 gpu_resources:0 tokens_per_second:376.36752451194593 iterations_per_second:1.344169730399807 grad_norm:15.191259666319432
Step 15 | loss:10.376498222351074 lr:2e-05 gpu_resources:0 tokens_per_second:1275.976294664742 iterations_per_second:3.2385185143775175 grad_norm:12.678521182631085
the gradient accumulation steps actually don't change any of this, only the total length of the total steps.
Confirmed that this works as expected with optimizer_in_bwd=True and gradient accumulation off, and optimizer_in_bwd=False and gradient accumulation on:
tune run full_finetune_single_device --config llama2/7B_full_low_memory optimizer_in_bwd=False gradient_accumulation_steps=5 max_steps_per_epoch=10
tune run full_finetune_single_device --config llama2/7B_full_low_memory
Hello guys, what else do we need here? Can I add these changes to the other recipes?
Hello guys, what else do we need here? Can I add these changes to the other recipes?
Sorry for the delay in responding. For the sake of not writing the same reply in 20 places, will just link to my comment here. Happy to let you take over that PR or absorb the changes into this one, either is fine with me.
I added the changes you put on running and computing the right amount of tokens. I am not sure the GPU metrics is something you want to log very often as it is expensive to compute. Do you know wy the grad accum test doesn't pass?
I am not sure the GPU metrics is something you want to log very often as it is expensive to compute.
We could consider adding a flag like log_memory_stats to the config to enable/disable this. What do you think?
Do you know wy the grad accum test doesn't pass?
It has to do with how we infer the loss for the test based on the logs (see here). Previously, whenever self.total_training_steps % self._log_every_n_steps == 0, we would log on every iteration of that step, so we would actually get a list of loss values (hence the np.mean in L239 of the above code pointer). With your change we only logged on the final iteration of the step, so the logged value would just be the loss on the last iteration, not the average over all the iterations. This is why I added running loss instead, to accumulate losses over each iteration and then log the mean for the step.
closing this as it was addressed in #831