swift icon indicating copy to clipboard operation
swift copied to clipboard

A simple loop that works with TF_EAGER but crashes with XLA

Open clarkdobson opened this issue 3 years ago • 6 comments

I'm running some simple timing tests comparing performance for Tensor<Float> vs [Float] and ran into some strange behavior. The basic code is below. With device = Device(kind: .CPU, ordinal: 0, backend: .TF_EAGER), everything runs as expected. Tensor results agree exactly with [Float] results, and the code prints |testArray - testTensor|_max = 0.0.

However, with device = Device(kind: .CPU, ordinal: 0, backend: .XLA) and with the parameters below, |testArray - testTensor|_max = 0.001953125. Also, memory usage is much greater. With nLoop >= 1024, the code simply crashes in the error check loop.

Two questions: (1) why does my code crash with XLA, and (2) why is the arithmetic different for XLA?

Thanks in advance!

//-------------------------- let tSize = 1024 let nLoop = 512

let testIntArray: [Int] = Array(1...tSize) let testFloatArray = testIntArray.map{Float($0)} var testArray = testFloatArray

//let device = Device(kind: .CPU, ordinal: 0, backend: .TF_EAGER) let device = Device(kind: .CPU, ordinal: 0, backend: .XLA) var testTensor = Tensor<Float>(shape: [tSize], scalars: testFloatArray, on: device)

for _ in 0..<nLoop { testTensor = 0.9999*testTensor }

for _ in 0..<nLoop { for j in testArray.indices { testArray[j] = 0.9999*testArray[j] } }

var maxLinf: Float = 0.0 for j in testArray.indices{ let absDiff = abs(testArray[j] - testTensor[j].scalar!) if absDiff > maxLinf { maxLinf = absDiff } } print("|testArray - testTensor|_max = ", maxLinf) //--------------------------------

clarkdobson avatar Oct 12 '20 23:10 clarkdobson

A first possibility for the crash: are you running this on macOS? We know that the XLA ComputationClient crashes on that platform for certain model types. We haven't fully tracked down the cause, but it does cause problems on that OS.

Beyond that, when you're setting this up via X10, it will attempt to trace through the whole loop, unrolling it all into one long graph. At some point, that graph could become big enough to lead to an out-of-memory error during compilation. Normally, you'd want to cut the trace in such a way that you have smaller, reusable traces, such as by placing a LazyTensorBarrier() after the calculation inside your loop. However, there currently is some tracing and dispatch overhead, so with a tight loop you might not see the kind of performance you would with eager tensors or even simple arrays if the data size is small enough.

As for the arithmetic, I'm not sure about that. You could try with the above-described LazyTensorBarrier() inside the loop and see if that still happens.

BradLarson avatar Oct 13 '20 02:10 BradLarson

Yes, I'm running this on macOS 10.15.7, although that is not our ultimate target.

The loop tracing out-of-memory scenario sounds likely, given that memory usage grows quickly with loop size and the error does not occur until a certain loop size is reached. It is interesting that the actual crash does not happen until after the loop has executed. Just trying to access testTensor[0] after the loop initiates the crash.

Inserting LazyTensorBarrier() or LazyTensorBarrier(on: devices, devices: []) inside the loop causes an exception "Thread 1: EXC_BAD_ACCCESS (code=1, address=0x0)" at the same line. Placing the call outside the loop raises the same exception at the first attempt to access a tensor element, regardless of the loop size.

The arithmetic issue seems to be independent of the looping. The code below prints '0.0' with the TF_EAGER device, but -3.5767556e-08 with XLA. On the order of roundoff error for floats, but one would still expect the result of two floats to agree.

Thanks very much for your help!

let testFloat: Float = 1.3 let scalarTensor = Tensor<Float>(1.3, on: device)

print("Difference = ", 0.9999*testFloat - 0.9999*scalarTensor)

clarkdobson avatar Oct 13 '20 15:10 clarkdobson

The LazyTensorBarrier() crash sounds exactly like the problems we've seen before on macOS. The loop-related crash is most likely related to the size of the trace.

Regarding the timing of the loop crash, the tensor tracing for X10 is lazy and is only triggered at the point where a value is needed. That defers JIT compilation to the point where you try to access the result, which is why you see the crash then and not when the loop is enqueuing the tensor operations. This is a way of preserving a straightforward eager-like programming model while being able to dispatch graphs of operations to an optimizing compiler as they are needed.

I started a new Swift Colab notebook, imported TensorFlow, took your first block of code, and inserted a LazyTensorBarrier() inside the testTensor loop. That didn't crash on the Linux runtime there, and produced a value of |testArray - testTensor|_max = 0.0 for both eager and XLA devices.

Your second code snippet does produce the slightly different value for eager vs. XLA backends, however. Don't know where tiny difference is creeping in there, whether it's in the calculation or input / output paths. It's worth noting, to see if it appears elsewhere.

BradLarson avatar Oct 13 '20 15:10 BradLarson

Many thanks, your explanation of the JIT compilation makes total sense. I'll try further tests on another platform.

Our target application does involve a fairly tight loop and we did see performance benefits with XLA. We could certainly break up or partially unroll the loop and periodically call LazyTensorBarrier() and hopefully still get a performance gain without the memory problems. Will XLA in the future be able to deal with loops a little more gracefully, or is this a necessary by-product of the graph analysis?

The arithmetic thing is obviously not a show stopper, but does seem like unexpected behavior. I'd be interested to know the cause.

clarkdobson avatar Oct 13 '20 16:10 clarkdobson

When testing performance on tight loops, you might see differences between XLA on the CPU side and our current GPU implementation, due to some dispatching overhead in the latter. There are cases where I've seen XLA outperform eager mode on CPU, but not on GPU. Also, I'll point out that our latest nightlies improve performance in some cases (sometimes significantly) for the eager backend, so it's worth experimenting with various configurations for different accelerators and platforms using our latest toolchains. XLA is the only way to access TPUs, so you of course need to use that if targeting them.

For improving performance in tight loops, there's the possibility at some point of exposing the XLA HLO While operation in some form. This would place the loop within the trace itself, leading to very fast performance in a case like this.

In your application that requires a tight loop, what kind of functionality will you need? Do you need much of the capability of our higher-level APIs or do you just need basic parallel computation running on accelerators? If the latter, I can't promise anything but we're working on an experimental approach that may yield much faster results on CPU or GPU for tight loops of simple parallel calculations.

BradLarson avatar Oct 13 '20 17:10 BradLarson

Awesome, good info, I'll try the nightlies.

The loop in question only requires basic functionality, mostly arithmetic, some slicing, and dot products. We are trying to keep the code base as architecture-agnostic as possible, but would certainly want to take advantage of GPUs or TPUs when available. We'll keep our fingers crossed, sounds like you guys have good stuff cooking!

clarkdobson avatar Oct 13 '20 20:10 clarkdobson