xla reduce the overhead incurred in FetchTensorData

When training model with multiple workers, two ExecuteComputation are not completely pipelined. We had an issue listed here: https://github.com/pytorch/xla/issues/3434 . This issue talks about two methods- RunPostOrder and FetchTensorData. Most part of RunPostOrder method was overlapped with previous execution with the help of this PR: https://github.com/pytorch/xla/pull/3457 . However the FetchTensorData method still remained as a bottleneck. The reason for this was: FetchTensorData runs after previous execution is complete and it runs before the next execution. In this method, it does two things: Reset the IR value and create xla_data. From our experiments resetting these IR values and creating xla_data for the tensors in the graph can be expensive as the graph grows in size. In such cases, it becomes a bottleneck. To avoid such scenarios, this PR creates a copy of the IR values and xla_data for all the tensors when the previous execution is on. It then uses that copy of ir_values for creating a graph for compilation for the next execution.

Sep 15 '22 16:09 aws-rhsoln

Out of curiosity, how much speed up does this pr bring.

Sep 21 '22 22:09 JackCaoG

This pr makes already pretty complicated lazy execution code more confusing, I want to understand how much speed up(and in what circustance, always, small model, large model etc) are we talking about here.

This PR brings 5-6% speedup on BERT large. This should give a significant boost when the execution is comparable to model tracing and other overheads.

Sep 22 '22 05:09 aws-rhsoln

Thanks, I will try to review it today. If you have the profile of before and after that would be super helpful.

Sep 26 '22 18:09 JackCaoG

Thanks, I will try to review it today. If you have the profile of before and after that would be super helpful.

I didn't get a chance to get it on GPU.. Will try to get it this week!

Sep 26 '22 23:09 aws-rhsoln

I think I understand the intention of this pr.

FetchData is called after the lock, because it will Remove all Tensor's IR Value and then replace it with a placeholder. These placeholder will be updated after the device execution.

The problem here is that Creating PlaceHolder can be expensive when graph size is large, what this pr does is pre-create those XLAData so the expensive operation can be overlapped.

Sep 29 '22 02:09 JackCaoG

@aws-rhsoln Unless you have a strong preference I will leave this pr out of the 1.13 release since that will be cut tmr. I will be out for a week so I won't be able to review it next week.

Sep 29 '22 20:09 JackCaoG

@aws-rhsoln Unless you have a strong preference I will leave this pr out of the 1.13 release since that will be cut tmr. I will be out for a week so I won't be able to review it next week.

Yeah should be fine, we can put this in the next release.

Sep 30 '22 16:09 aws-rhsoln

This pr makes already pretty complicated lazy execution code more confusing, I want to understand how much speed up(and in what circustance, always, small model, large model etc) are we talking about here.

This PR brings 5-6% speedup on BERT large. This should give a significant boost when the execution is comparable to model tracing and other overheads.

I tested this change using BERT large on GPU device. Seeing a 3% improvement in throughput. Previously the gap between two executeComputations was about 9ms. This has now reduced to roughly 2.5ms Screen Shot 2022-10-05 at 9 55 16 AM Screen Shot 2022-10-05 at 4 11 51 PM

Oct 06 '22 01:10 aws-rhsoln

This mostly lgtm, @alanwaketan can you take another pass? This pr will help us to shrink the gap between executions but it will change a bunch of API. What's the implication to the LTC migration we have?

Oct 12 '22 20:10 JackCaoG

Changing method signature will definitely affect the LTC migration plan. Wonder if this change can be applied to the upstream LTC as well? @aws-rhsoln

Oct 12 '22 20:10 alanwaketan

@alanwaketan yea, I think this change should also be upstreamed to LTC.

Oct 12 '22 21:10 JackCaoG

@alanwaketan yea, I think this change should also be upstreamed to LTC.

Can you point me to the LTC code base, I can take a look

Here we are: https://github.com/pytorch/pytorch/blob/master/torch/csrc/lazy/core/lazy_graph_executor.h. It should be pretty similar.

Oct 12 '22 22:10 alanwaketan

@alanwaketan yea, I think this change should also be upstreamed to LTC.

Can you point me to the LTC code base, I can take a look

Here we are: https://github.com/pytorch/pytorch/blob/master/torch/csrc/lazy/core/lazy_graph_executor.h. It should be pretty similar.

Oh cool! Taking a look

Oct 12 '22 22:10 aws-rhsoln

@alanwaketan yea, I think this change should also be upstreamed to LTC.

Can you point me to the LTC code base, I can take a look

Here we are: https://github.com/pytorch/pytorch/blob/master/torch/csrc/lazy/core/lazy_graph_executor.h. It should be pretty similar.

Oh cool! Taking a look

Okay looks like it is very similar to the one we have here. One question: Do we need to first merge LTC changes and then come here and merge this one? Or both are decoupled?

Oct 12 '22 22:10 aws-rhsoln

I guess we can see if the same technique can be applied to the upstream first, i.e., having a pull request first. If the upstream PR works, it shouldn't matter which pull request get merged first. We can review upstream LTC PRs too.

Oct 12 '22 23:10 alanwaketan

Here is the PR for lazy tensor core: https://github.com/pytorch/pytorch/pull/87119/files @alanwaketan

Oct 17 '22 19:10 aws-rhsoln

If this PR looks good, can we merge it? @JackCaoG

Oct 21 '22 05:10 aws-rhsoln

@alanwaketan does the upstream one lgty? If upstream pr align the api with this one I can merge this pr to unblock aws team.

Oct 21 '22 17:10 JackCaoG

@alanwaketan does the upstream one lgty? If upstream pr align the api with this one I can merge this pr to unblock aws team.

Reviewing it.

Oct 21 '22 18:10 alanwaketan

There were still some unresolved comments on upsteram, @aws-rhsoln can you resolve those. Let me take last pass of this pr then I will merge it.

Oct 21 '22 18:10 JackCaoG

Thanks @aws-rhsoln !

Oct 24 '22 17:10 JackCaoG

Thanks for the PR. @JackCaoG did we do any performance analysis on Google side to verify perf gains here? CC @chandrasekhard2

Oct 24 '22 19:10 miladm

no, I expect the gain is similar. This is not really device dependent.

Oct 24 '22 19:10 JackCaoG