xla Unexpected memory usage by the view op

🐛 Bug

View operators resulted in unnecessary memory usage.

To Reproduce

When running the following code, the memory usage is 1024*4*3 bytes. This memory usage can be observed in the BFC allocator or buffer assignment.

import gc
import torch
import torch_xla
import torch_xla.core.xla_model as xm
device = xm.xla_device()
x = torch.randn(1024, requires_grad=False).to(device)

a = x[0:1024]
a = a.view(2, 512)
gc.collect()
xm.mark_step()

The hlo graph is:

ENTRY SyncTensorsGraph.5 {
  p0.1 = f32[1024]{0} parameter(0)
  slice.2 = f32[1024]{0} slice(p0.1), slice={[0:1024]}
  reshape.3 = f32[2,512]{1,0} reshape(slice.2)
  ROOT tuple.4 = (f32[1024]{0}, f32[1024]{0}, f32[2,512]{1,0}) tuple(p0.1, slice.2, reshape.3)
}

Expected behavior

slice and reshape should be implemented as view operators in XLA. In this case, slice and reshape should not be outputs of the graph, and even if they are outputs, they should not consume additional memory but should use the memory of x.

Additional context

When running the following code

import gc
import torch
import torch_xla
import torch_xla.core.xla_model as xm
device = xm.xla_device()
x = torch.randn(1024, requires_grad=False).to(device)

a = x[0:1024]
a = a.view(2, 512)
dummy = torch.zeros(1, dtype=a.dtype, device=a.device)
torch_xla._XLAC._replace_xla_tensor(a, dummy)
torch_xla._XLAC._replace_xla_tensor(x, dummy)

gc.collect()
xm.mark_step()

The hlo is:

ENTRY SyncTensorsGraph.7 {
  p0.1 = f32[1024]{0} parameter(0)
  slice.2 = f32[1024]{0} slice(p0.1), slice={[0:1024]}
  constant.3 = f32[] constant(0)
  reshape.4 = f32[1]{0} reshape(constant.3)
  broadcast.5 = f32[1]{0} broadcast(reshape.4), dimensions={0}
  ROOT tuple.6 = (f32[1024]{0}, f32[1]{0}) tuple(slice.2, broadcast.5)
} // SyncTensorsGraph.7

This is also quite strange. The intermediate variable slice should not be an output of the graph.

Environment

Reproducible on XLA backend [CPU/TPU/CUDA]: CUDA 12.1
torch_xla version: master 2bbc9a49408b7e3394fd81f222e421d9871f1aa0

Feb 23 '24 06:02 yitongh

This is kind of expected. There is no way to represent "these HLOs shares the same storage" in the HLO semantic, we can only alias the input and output buffer. In the cases that view op changes the shape of the original tensor, they have to allocate new buffers.

Feb 26 '24 19:02 JackCaoG

@JackCaoG Yes, I am aware that view operations cannot be expressed in HLO, but with the introduction of aliasing, these operators should use the same buffer during buffer assignment. As shown in the buffer assignment below, three buffers of size 4096 are used.

BufferAssignment:
allocation 0: size 4096, parameter 0, shape |f32[1024]| at ShapeIndex {}, maybe-live-out:
 value: <12 p0.1 @0> (size=4096,offset=0): f32[1024]{0}
allocation 1: size 4096, maybe-live-out:
 value: <16 copy_fusion{1} @0> (size=4096,offset=0): f32[1024]{0}
allocation 2: size 4096, maybe-live-out:
 value: <17 copy_fusion{2} @0> (size=4096,offset=0): f32[1024]{0}
allocation 3: size 32, output shape is |(f32[1024], f32[1024], f32[2,512], f32[1])|, maybe-live-out:
 value: <18 tuple.1{} @0> (size=32,offset=0): (f32[1024]{0}, f32[1024]{0}, f32[2,512]{1,0}, f32[1]{0})
allocation 4: size 4, constant:
 value: <13 constant @0> (size=4,offset=0): f32[1]{0}
allocation 5: size 4, maybe-live-out:
 value: <15 copy_fusion{0} @0> (size=4,offset=0): f32[1]{0}
allocation 6: size 24, preallocated-temp:
 value: <14 copy_fusion{} @0> (size=24,offset=0): (f32[1]{0}, f32[1024]{0}, f32[1024]{0})

Total bytes used: 12352 (12.1KiB)

Additionally, why would the intermediate temporary variable 'slice' be an output of the graph? On the Python side, 'slice' (x[0:1024]) should have already destroyed by the time of mark_step.

Feb 27 '24 11:02 yitongh

xla xla copied to clipboard

Unexpected memory usage by the view op

🐛 Bug

To Reproduce

Expected behavior

Additional context

Environment

xla
xla copied to clipboard