Megatron-LM
Megatron-LM copied to clipboard
[QUESTION] Calculations regarding calculate_per_token_loss parameter
In line 231-233 in megatron/core/pipeline_parallel/schedules.py (megatron/core/pipeline_parallel/schedules.py), I have two questions:
- Why are we dividing by num_tokens when the conditional is "if not config.calculate_per_token_loss"
- What is the purpose of dividing by num_microbatches if it is a constant, and if it is important, why do we not also divide by num_microbatches outside of the condition for the config.calculate_per_token_loss true case.