Dagger.jl icon indicating copy to clipboard operation
Dagger.jl copied to clipboard

Logging is slow

Open m-fila opened this issue 5 months ago • 2 comments

In a single multithreaded process, logging with MultiEventLog brings massive overheads. For 8 threads and 40 graphs ~800 vertices each we observe ~7x application slow down and ~26x increase of lock conflicts. With the overhead this big logging isn't very useful as it considerably alters the application (way more time is spent on logging than on running the actual application)

Without logging

Pipeline (throughput 1.00 events/s): 40.097314 seconds (62.55 M allocations: 3.252 GiB, 3.20% gc time, 128376 lock conflicts, 0.47% compilation time)

With logging

Dagger.enable_logging!(tasknames = true,
                       taskfuncnames = true,
                       taskdeps = true,
                       taskargs = true,
                       taskargmoves = true,
                       taskresult = true,
                       taskuidtotid = true,
                       tasktochunk = true)
Pipeline (throughput 0.14 events/s): 290.967248 seconds (233.25 M allocations: 19.686 GiB, 4.86% gc time, 3346075 lock conflicts, 0.00% compilation time)

m-fila avatar Jun 27 '25 11:06 m-fila

Can you provide an MWE? Also, please try with metrics=false and let me know if this helps significantly - I probably need to just disable that one by default, since it makes OS calls that can be expensive.

jpsamaroo avatar Jun 27 '25 20:06 jpsamaroo

Thanks, indeed metrics=false seems to help significantly. Now it's down to 40% slow down, which is still a bit high

Pipeline (throughput 0.71 events/s): 56.636791 seconds (106.55 M allocations: 8.984 GiB, 7.98% gc time, 1045838 lock conflicts, 0.00% compilation time)

I'll try to return with a MWE later

m-fila avatar Jun 27 '25 21:06 m-fila

#626 has disabled metrics and enabled all_task_deps by default.

Still, it would be worth investigating some improvements to logging; in particular, doing some work to improve type stability, and trying to avoid the context-global log lock, would probably help significantly.

jpsamaroo avatar Jul 03 '25 16:07 jpsamaroo

For timeline profiling (:plots_gantt, :chrome_trace) we switched to NVTX.jl, the overheads seems to be <1%. But still Dagger logging is very helpful for other things as it contains a lot information that won't be accessible otherwise like dependencies between scheduled tasks

m-fila avatar Jul 19 '25 09:07 m-fila