iree icon indicating copy to clipboard operation
iree copied to clipboard

Drop tracy from CI benchmarks?

Open bjacob opened this issue 11 months ago • 19 comments

Our benchmarks on CI run twice, without and with Tracy. This is a substantial cost and latency hit (not quite 2x as I can see that tracy is run with 4 repetitions where non-tracy does 10). The motivation was that we could download and inspect traces from regressions caught on CI, but does anyone do that in practice? We don't inspect CI benchmark regressions every day, and when we do, usually we just reproduce the regression locally, which is often necessary anyway to resolve it.

@pzread @hanhanW @benvanik @ScottTodd

bjacob avatar Mar 20 '24 19:03 bjacob

Could run an experiment with and without tracing to quantify the cost. For Linux the code to change appears to be https://github.com/openxla/iree/blob/767a6112abffebd896ceb2f0107c0d603c2a338a/build_tools/benchmarks/run_benchmarks_on_linux.py#L87-L110

ScottTodd avatar Mar 20 '24 19:03 ScottTodd

Another thing we could do would be to set --iree-hal-executable-debug-level=3 to embed source files, since that works now (thanks to https://github.com/openxla/iree/pull/16757). Traces would be more useful with that.

But yeah - not sure if enough people are monitoring the current CI benchmark infra to justify the ongoing costs, and dropping trace collection would likely save quite a bit of time.

ScottTodd avatar Mar 20 '24 19:03 ScottTodd

I'm okay to drop it. But IMO it is useful in the cases you don't have the access to the devices (e.g. Pixel phones, Nvidia cards).

Recently I was debugging the regressions in https://github.com/openxla/iree/pull/16731 and I was able to download the baseline and PR tracy files to quickly spot the slowest dispatches and find the problems (in this case the regressions are large so it is easy to spot)

pzread avatar Mar 20 '24 20:03 pzread

the cost should only be in execution time as the same vmfbs are used - I thought we weren't bounded by execution time?

having them nightly/on releases would still be useful, if we have a nightly runner - when you need them you need them, and finding out you need to compare with something older and need to generate them can waste a day

benvanik avatar Mar 20 '24 20:03 benvanik

and yeah, we should probably always have debug info on the bots - it has zero runtime overhead but does introduce a text printing step to compilation - probably worth timing.

benvanik avatar Mar 20 '24 20:03 benvanik

Watch the logs while the benchmarks run and you'll see we are definitely waiting on execution :-) While the vmfbs are shared, there is also the overhead of transferring and storing the traces. Great data point @pzread , to hear you actually had a recent use case.

bjacob avatar Mar 20 '24 20:03 bjacob

The other thing we can do is make it opt-in for particular benchmarks. The major benefit to me is looking for improvement/regressions over time that show up in end-to-end performance, and I don't care about having it on every device or every configuration. Not having to check out an ancient branch, get things going, get it on the hardware we measure on, and perfectly replicated is important. But one run of each model architecture on each machine is worth it. We've got bigger fish to fry on our process than removing some of the tiny amount of information we actually have.

benvanik avatar Mar 20 '24 20:03 benvanik

Ran the experiment at https://github.com/openxla/iree/pull/16857... found that wholesale dropping all the related code would be a -345 lines of code shrink, and found the following timings from comparing that PR with another PR I ran today against the same main commit (https://github.com/openxla/iree/pull/16855):

CI job With tracy capture (#16855) Without tracy capture (#16857) Speedup
run_benchmarks (c2-standard-60, linux, x86_64, 0, 2) 22 min 2 s 11 min 46 s 1.87x
run_benchmarks (c2-standard-60, linux, x86_64, 1, 2) 21 min 1 s 13 min 36 s 1.54x
run_benchmarks (pixel-6-pro, android, armv8.2-a, 0, 1) 21 min 12 s 11 min 5 s 1.92x

bjacob avatar Mar 20 '24 21:03 bjacob

What @pzread said, it is useful when you don't have the access to the devices. I'd use it to debug pixel regressions when there are needs. It mostly happens when I'm on LLVM integrate rotation. I think it is okay to drop because we can reassign those regressions to ARM and Google folks.

hanhanW avatar Mar 20 '24 21:03 hanhanW

yeah I don't think we care about the pixel phones anymore - I am mostly concerned with losing the ability to look at what fusion decisions/whole architecture decisions we are changing over time. If we can do that on one bot for one configuration (with the option to always add more) and only for the major architectures that'd be enough for me. I don't like the idea of losing coverage entirely, though. We have a really bad story around visibility into memory consumption, latency, and scheduling and the traces are literally all we have.

benvanik avatar Mar 20 '24 21:03 benvanik

What if we only disabled Tracy captures on PR runs?

bjacob avatar Mar 21 '24 14:03 bjacob

What if we only disabled Tracy captures on PR runs?

That seems reasonable to me. If someone wants to see them on presubmit they could even just edit the code that makes them postsubmit only (if it's a switch localized to 1-2 files).

ScottTodd avatar Mar 21 '24 16:03 ScottTodd

@pzread how do you feel about disabling Tracy on PR runs, or making it an extra opt-in?

ScottTodd avatar Mar 26 '24 16:03 ScottTodd

opt-in on PR via a label and then continuously run on main SGTM

benvanik avatar Mar 26 '24 16:03 benvanik

Make it optional and off by default on PR runs SGTM

pzread avatar Mar 26 '24 16:03 pzread

It's been a few months with no action taken here. Is anyone using the traces from the benchmark CI? At this point I'd vote to remove Tracy from that path entirely, both to save on infrastructure costs and to make it easier to update the version of Tracy used in the project (we currently need to upload some builds of Tracy to GCS as in https://github.com/iree-org/iree/pull/15807)

ScottTodd avatar May 13 '24 15:05 ScottTodd

I do occasionally, but am fine with telling people that if they want me to investigate a performance issue they need to get me a trace so 🤷

benvanik avatar May 13 '24 15:05 benvanik

(all for having easier tracy bumps!)

benvanik avatar May 13 '24 15:05 benvanik

Maybe as a first step disable on PR runs as discussed above https://github.com/iree-org/iree/issues/16856#issuecomment-2012396348. Then after a month of that, ask whether to drop altogether.

bjacob avatar May 13 '24 15:05 bjacob

Maybe as a first step disable on PR runs as discussed above #16856 (comment). Then after a month of that, ask whether to drop altogether.

Unless that benchmarking has a dedicated owner committed to maintaining it, I'd rather cut down the costs ASAP (both maintenance complexity and $$$).

ScottTodd avatar May 13 '24 18:05 ScottTodd

Ran the experiment at #16857... found that wholesale dropping all the related code would be a -345 lines of code shrink, and found the following timings from comparing that PR with another PR I ran today against the same main commit (#16855):

CI job With tracy capture (#16855) Without tracy capture (#16857) Speedup
run_benchmarks (c2-standard-60, linux, x86_64, 0, 2) 22 min 2 s 11 min 46 s 1.87x
run_benchmarks (c2-standard-60, linux, x86_64, 1, 2) 21 min 1 s 13 min 36 s 1.54x
run_benchmarks (pixel-6-pro, android, armv8.2-a, 0, 1) 21 min 12 s 11 min 5 s 1.92x

I'm also noticing that the Android times doubled at some point. pixel-6-pro benchmarks now take around 40 minutes:

  • https://github.com/iree-org/iree/actions/runs/9057839745/job/24883176846
  • https://github.com/iree-org/iree/actions/runs/9066484065/job/24911823046

ScottTodd avatar May 13 '24 21:05 ScottTodd

Oh, I was looking at android-cpu + android-gpu (the default config for postsubmit), while that table was generated on a pull request with just android-cpu only. The timings make sense then, no time doubling. Phew.

ScottTodd avatar May 13 '24 21:05 ScottTodd