torchtune icon indicating copy to clipboard operation
torchtune copied to clipboard

Organise steps logic

Open tcapelle opened this issue 1 year ago • 8 comments

The idea is showing:

  • A progress bar with the actual total count
  • Having the same steps logged and reported on the progress bar
  • Count a training step as an optimizer step to be consistent with other libraries like transformers.Trainer and Axolotl.
  • Add some QPS and grad norm

tcapelle avatar Apr 15 '24 16:04 tcapelle

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/730

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot[bot] avatar Apr 15 '24 16:04 pytorch-bot[bot]

Are these the types of QPS you were expecting @msaroufim ?

tcapelle avatar Apr 15 '24 19:04 tcapelle

Running this command:

tune run full_finetune_single_device --config my_dev/tiny_llama.yaml \
  log_every_n_steps=5 \
  max_steps_per_epoch=10 \
  epochs=2

you get a progress bar with only 10 steps:

INFO:torchtune.utils.logging:Dataset and Sampler are initialized.
1|10|Loss: 10.452272415161133: 100%|█████████████████████████████████████████████████████████████████| 10/10 [00:03<00:00,  3.65it/s]INFO:torchtune.utils.logging:Model checkpoint of size 0.08 GB saved to my_dev/tiny_llama/checkpoints/torchtune_model_0.pt
INFO:torchtune.utils.logging:Recipe checkpoint of size 0.16 GB saved to my_dev/tiny_llama/checkpoints/recipe_state.pt
1|10|Loss: 10.452272415161133: 100%|█████████████████████████████████████████████████████████████████| 10/10 [00:03<00:00,  2.85it/s]
  0%|                                                                                                         | 0/10 [00:00<?, ?it/sINFO:torchtune.utils.logging:Model checkpoint of size 0.08 GB saved to my_dev/tiny_llama/checkpoints/torchtune_model_1.pt0,  1.72it/s]
2|20|Loss: 10.449804306030273: 100%|█████████████████████████████████████████████████████████████████| 10/10 [00:05<00:00,  1.78it/s]

and you get only 2 logs per epoch:

Step 0 | loss:10.565682411193848 lr:2e-05 gpu_resources:0 tokens_per_second:456.04617469323597 iterations_per_second:1.6523412126566521 grad_norm:16.26054121046421 
Step 5 | loss:10.421597480773926 lr:2e-05 gpu_resources:0 tokens_per_second:620.2041065388681 iterations_per_second:3.828420410733754 grad_norm:23.697011790378774 
Step 10 | loss:10.490856170654297 lr:2e-05 gpu_resources:0 tokens_per_second:376.36752451194593 iterations_per_second:1.344169730399807 grad_norm:15.191259666319432 
Step 15 | loss:10.376498222351074 lr:2e-05 gpu_resources:0 tokens_per_second:1275.976294664742 iterations_per_second:3.2385185143775175 grad_norm:12.678521182631085 

the gradient accumulation steps actually don't change any of this, only the total length of the total steps.

tcapelle avatar Apr 17 '24 11:04 tcapelle

Confirmed that this works as expected with optimizer_in_bwd=True and gradient accumulation off, and optimizer_in_bwd=False and gradient accumulation on:

tune run full_finetune_single_device --config llama2/7B_full_low_memory optimizer_in_bwd=False gradient_accumulation_steps=5 max_steps_per_epoch=10 tune run full_finetune_single_device --config llama2/7B_full_low_memory

RdoubleA avatar Apr 17 '24 21:04 RdoubleA

Hello guys, what else do we need here? Can I add these changes to the other recipes?

tcapelle avatar Apr 22 '24 09:04 tcapelle

Hello guys, what else do we need here? Can I add these changes to the other recipes?

Sorry for the delay in responding. For the sake of not writing the same reply in 20 places, will just link to my comment here. Happy to let you take over that PR or absorb the changes into this one, either is fine with me.

ebsmothers avatar Apr 22 '24 14:04 ebsmothers

I added the changes you put on running and computing the right amount of tokens. I am not sure the GPU metrics is something you want to log very often as it is expensive to compute. Do you know wy the grad accum test doesn't pass?

tcapelle avatar Apr 22 '24 15:04 tcapelle

I am not sure the GPU metrics is something you want to log very often as it is expensive to compute.

We could consider adding a flag like log_memory_stats to the config to enable/disable this. What do you think?

Do you know wy the grad accum test doesn't pass?

It has to do with how we infer the loss for the test based on the logs (see here). Previously, whenever self.total_training_steps % self._log_every_n_steps == 0, we would log on every iteration of that step, so we would actually get a list of loss values (hence the np.mean in L239 of the above code pointer). With your change we only logged on the final iteration of the step, so the logged value would just be the loss on the last iteration, not the average over all the iterations. This is why I added running loss instead, to accumulate losses over each iteration and then log the mean for the step.

ebsmothers avatar Apr 22 '24 19:04 ebsmothers

closing this as it was addressed in #831

RdoubleA avatar Apr 25 '24 06:04 RdoubleA