xla
xla copied to clipboard
[XLA:GPU] Optimize command buffer update by only operating on commands that have pointer change
Do you have any benchmarks for this change? I recall I was running synthetic benchmarks and graph update is so cheap that I decided it's not worth bothering, certainly not worth leaking implemntation details through so many abstraction levels. If it's indeed a significant performance improvement, then "something better" should be baked into se::CommandBuffer
itself
Do you have any benchmarks for this change? I recall I was running synthetic benchmarks and graph update is so cheap that I decided it's not worth bothering, certainly not worth leaking implemntation details through so many abstraction levels. If it's indeed a significant performance improvement, then "something better" should be baked into
se::CommandBuffer
itself
I am still make this fix working, it has some broken. I feel like update cost is worth to triage on some larger models. I will test it how selectively update's benefits, and if so, we can found maybe better implementation.
Sounds good! If it's indeed expensive let's find a solultion.
FYI I remembered that I have microbenchmarks here: https://github.com/openxla/xla/blob/main/xla/stream_executor/gpu/gpu_command_buffer_test.cc#L1274-L1356 - you can add something more representative there, although it's part of SE so it can be only kernel launches (can't depend on XLA)
@shawnwang18 Please let us know when this PR is ready for review. Thanks!