xla Dynamic padder breaks on dynamic input

The following PR is being tested to improve dynamic shape functionality on simple models.

Dynamic padder crashes upon the call to %p4.7 = f32[<=20,2]{1,0} parameter(4) op. IIUC, p4.7 represents y_pred. See the HLO dump and the tf error message below.

$ print("[RUNNING] _get_xla_tensors_hlo([y])\n", torch_xla._XLAC._get_xla_tensors_hlo([y_pred]))



[RUNNING] _get_xla_tensors_hlo([y])
 HloModule IrToHlo.32, entry_computation_layout={(f32[1]{0},f32[1,10]{1,0},f32[10]{0},f32[10,2]{1,0},f32[<=20,2]{1,0})->(f32[<=20,1]{1,0})}

ENTRY %IrToHlo.32 (p0.1: f32[1], p1.2: f32[1,10], p2.4: f32[10], p3.5: f32[10,2], p4.7: f32[<=20,2]) -> (f32[<=20,1]) {
  %constant.23 = f32[] constant(0.5)
  %broadcast.29 = f32[<=20,1]{1,0} broadcast(f32[] %constant.23), dimensions={}
  %broadcast.27 = f32[<=20,1]{1,0} broadcast(f32[] %constant.23), dimensions={}
  %broadcast.24 = f32[<=20,1]{1,0} broadcast(f32[] %constant.23), dimensions={}
  %p4.7 = f32[<=20,2]{1,0} parameter(4)
  %p3.5 = f32[10,2]{1,0} parameter(3)
  %transpose.6 = f32[2,10]{0,1} transpose(f32[10,2]{1,0} %p3.5), dimensions={1,0}
  %dot.8 = f32[<=20,10]{1,0} dot(f32[<=20,2]{1,0} %p4.7, f32[2,10]{0,1} %transpose.6), lhs_contracting_dims={1}, rhs_contracting_dims={0}
  %p2.4 = f32[10]{0} parameter(2)
  %reshape.9 = f32[1,10]{1,0} reshape(f32[10]{0} %p2.4)
  %broadcast.10 = f32[1,10]{1,0} broadcast(f32[1,10]{1,0} %reshape.9), dimensions={0,1}
  %reshape.11 = f32[10]{0} reshape(f32[1,10]{1,0} %broadcast.10)
  %broadcast.12 = f32[20,10]{1,0} broadcast(f32[10]{0} %reshape.11), dimensions={1}
  %add.13 = f32[<=20,10]{1,0} add(f32[<=20,10]{1,0} %dot.8, f32[20,10]{1,0} %broadcast.12)
  %constant.14 = f32[] constant(0)
  %broadcast.15 = f32[<=20,10]{1,0} broadcast(f32[] %constant.14), dimensions={}
  %maximum.16 = f32[<=20,10]{1,0} maximum(f32[<=20,10]{1,0} %add.13, f32[<=20,10]{1,0} %broadcast.15)
  %p1.2 = f32[1,10]{1,0} parameter(1)
  %transpose.3 = f32[10,1]{0,1} transpose(f32[1,10]{1,0} %p1.2), dimensions={1,0}
  %dot.17 = f32[<=20,1]{1,0} dot(f32[<=20,10]{1,0} %maximum.16, f32[10,1]{0,1} %transpose.3), lhs_contracting_dims={1}, rhs_contracting_dims={0}
  %p0.1 = f32[1]{0} parameter(0)
  %reshape.18 = f32[1,1]{1,0} reshape(f32[1]{0} %p0.1)
  %broadcast.19 = f32[1,1]{1,0} broadcast(f32[1,1]{1,0} %reshape.18), dimensions={0,1}
  %reshape.20 = f32[1]{0} reshape(f32[1,1]{1,0} %broadcast.19)
  %broadcast.21 = f32[20,1]{1,0} broadcast(f32[1]{0} %reshape.20), dimensions={1}
  %add.22 = f32[<=20,1]{1,0} add(f32[<=20,1]{1,0} %dot.17, f32[20,1]{1,0} %broadcast.21)
  %multiply.25 = f32[<=20,1]{1,0} multiply(f32[<=20,1]{1,0} %broadcast.24, f32[<=20,1]{1,0} %add.22)
  %tanh.26 = f32[<=20,1]{1,0} tanh(f32[<=20,1]{1,0} %multiply.25)
  %multiply.28 = f32[<=20,1]{1,0} multiply(f32[<=20,1]{1,0} %broadcast.27, f32[<=20,1]{1,0} %tanh.26)
  %add.30 = f32[<=20,1]{1,0} add(f32[<=20,1]{1,0} %broadcast.29, f32[<=20,1]{1,0} %multiply.28)
  ROOT %tuple.31 = (f32[<=20,1]{1,0}) tuple(f32[<=20,1]{1,0} %add.30)
}

2022-10-14 21:06:20.627148: E 3391276 tensorflow/compiler/xla/service/dynamic_padder.cc:1864] Inserting PadToStatic for instruction: %p4.7 = f32[<=20,2]{1,0} parameter(4)
2022-10-14 21:06:20.664005: F 3391557 tensorflow/tsl/platform/statusor.cc:33] Attempting to fetch value instead of handling error INVALID_ARGUMENT: Executable expected parameter 4 of size 160 but got buffer with incompatible size 168

CC @Krovatkin @JackCaoG @vanbasten23

Oct 14 '22 21:10 miladm

I am following up with tensorflow/XLA to find the root cause

Oct 14 '22 21:10 miladm

it is complaining that it is expecting a parameter with size 160 byte(20 * 2 * 4 byte) but the parameter passed to graph is 168 byte. Is it on TPU?

Oct 14 '22 21:10 JackCaoG

That's right - I guess the number is changed to 21 to reach 168. I wonder if the padder does this.
Nope running on CPU.

CC @blakehechtman

Oct 14 '22 22:10 miladm

According to @blakehechtman: the expected number is 168. Turns out the wrong shape size function is called by ignoring dynamism information propagated downstream. This is either a PjRt issue because this framework has never experienced dynamism so far or a CPU backend shortcoming.

I will update this thread on my TPU run.

CC @will-cromar

Oct 17 '22 04:10 miladm

@miladm Are you testing with PJRT_DEVICE=CPU?

Turns out the wrong shape size function is called by ignoring dynamism information propagated downstream.

Could this comment be related? It wasn't clear to me the right shape function to use in the outputs of PjRtComputationClient::ExecuteComputation. There are two options in PjRtBuffer: on_device_shape and logical_on_device_shape. Here's the difference according to the docs for logical_on_device_shape:

  // Same as on_device_shape when the shape is static. When the shape is
  // dynamic, it gathers the metadata from the device and returns a static shape
  // representing the logical shape of the data. This approach is identical to
  // how tensorflow and xrt setup the output buffer in the graph.
  //
  // Since this method actually acquires locks and communicate with the device,
  // it does not have the const qualifier, similar to what ToLiteral does.

For TFRT (including the current CPU client and the even-more-experimental TFRT TPU client), we use the on_device_shape since afaab1f65eb85a723c66df34fba59d9ef8be87ec. SE still uses logical_on_device_shape. Maybe try rebasing past that commit and see if it solves your issue? If so, I can send a PR to use on_device_shape on both runtime types.

Oct 17 '22 18:10 will-cromar

CC @wconstab

Oct 18 '22 20:10 miladm

When running the model on TPU, I hit a failure that occurs before the code reaches this line - which is odd. Here is the issue: https://github.com/pytorch/xla/issues/4108. I will revisit this issue after.

Oct 20 '22 03:10 miladm

Hi, Milad @miladm , is it ok to assign this issue to you?

Dec 29 '22 00:12 ManfeiBai