viztracer icon indicating copy to clipboard operation
viztracer copied to clipboard

Time.monotonic makes it impossible to line up distributed traces

Open bobhansen opened this issue 1 year ago • 1 comments

We're using viztracer for lightweight tracing when training pytorch models. Running in a datacenter, all of the clocks are synchronized to within some small number of ms. Since viztracer uses only the monotonic clock during tracing (absolutely the correct answer), traces from different machines will have wildly different timestamps. Since we can't force the traces to start at the same moment, the --align_combine feature gets them to within seconds of each other (some improvement!) but I think we can do better.

It would be keen to have an option (or update the default) to calculate the offset between the system time and monotonic time during trace save, and offset the timestamp by that difference. That way, we will project the monotonic time into global time (+/- the error of the system clock), and be able to compare traces that have been combined.

If it's something you're interested in, I can look into making a PR.

bobhansen avatar Aug 08 '23 14:08 bobhansen

What are you looking for to solve this issue? There are a couple of ways to do this.

  1. Post-run edit. This would be the most straightforward way to solve the problem and you probably do not even need any changes from viztracer. Loop through events and do the offset as you want.
  2. Have an option to pass in an offset, which is 0 by default, then do the offset when saving the trace. This is not too bad, but there will be C code involved and the trace saving part is .. hmm, not the cleanest code to follow.
  3. Do it on run-time, add the offset when getting the timestamp. This would probably be the easiest as getts is already a function, but I don't want to do this as it hits performance.
  4. An even more interesting way, to add system clock(or it's offset to monitonic clock) to metadata, and let --combine command to solve it. Similar to --align_combine, but with a known offset.

gaogaotiantian avatar Aug 08 '23 23:08 gaogaotiantian