Metrics reporting

Open NeoLegends opened this issue 3 months ago • 1 comments

I wonder if we want to add a way of collecting (timing/memory/etc.) metrics to RETURNN. This could be helpful to get insights into why your training/forward is not performing like you would otherwise expect. We could report things like precise timings for various parts of the train step or dataset buffer levels.

I wonder if the people at the chair would care at all for such a feature. I'd expect the trainings there to be more predictable than perhaps those at AppTek.

Sep 24 '25 08:09 NeoLegends

We do that already. Timings are collected for various things. E.g. also computation time is measured (so you see whether dataset is a bottleneck).

For (CPU) memory, there is watch_memory, which is very useful. I always use that.

We also have options like torch_log_memory_usage.

So, what more do you want? I think you should make your suggestion here more specific. Let's discuss about some very specific things you want to measure. And does it make sense to really add it to the main training loop code, which potentially adds some overhead, and makes the code more complicated? If this is for debugging, profiling, or so, I think some dedicated code to do just that makes maybe more sense? For example, when I want to profile some new model code, I use the Torch profiler, and just add that in parts of my model code, or write maybe some custom dummy train loop or so.

Sep 24 '25 09:09 albertz