OLMo
OLMo copied to clipboard
Activation logging
Logging for activations for all modules.
Updates (@epwalsh):
For each module, we log the activation L2 norm, average, absolute min, and absolute max. The are reduced over all ranks. Note that the way I have it implemented, the L2 norm is reduced by averaging over ranks. I thought that made the most sense because otherwise the scale of the metric depends on the world size.
For the small test model I ran there is only a small hit to throughput. For larger models we can increase the logging interval if it slows training down too much.