Halide copy_to_host locks too much

Currently in device_interface.cpp we grab a big global lock in any of the device API calls. E.g. copy_to_device grabs a lock before it does anything to the halide_buffer_t. The device APIs themselves are responsible for being thread-safe, so this isn't to protect internal device API state - it's to protect the halide_buffer_t state.

I think the intent of this lock is to make device api calls on a buffer atomic with respect to other device api calls on that same buffer struct. E.g. if two threads are copying to device at the same time on the same buffer, you don't want to do two device mallocs and leak one of them due to a race. There may also be synchronization issues around setting dirty flags after doing the call that makes those flags correct, and having something on another thread happen in between.

However, no-op copies to device are also grabbing the global lock, so this pointlessly serializes all Halide pipelines in cases where all buffers are already all on device and not dirty on host. This has caused a performance issue in production.

We need to figure out a more performant way to make these device api operations atomic with respect to each other (if that is indeed the intent). We probably can't say "you're not allowed to modify a buffer_t on multiple threads at once, because more will go wrong than just this anyway", because at this point someone might be relying on these locks for correctness. We can probably move the locks inwards so that they only surround the code that actually modifies buffer state though. I think it's ok to check if the copy would be a no-op outside the lock, though we have to be very careful about this because it's double-checked locking. We need to ensure that the check outside the lock can't be confused by a partially-modified state the buffer_t is in on another thread inside the lock.

For copy_to_device the early-out check is buffer->device && !buffer->host_dirty(). So it touches two fields. Yay. I guess we need to consider all potential interleavings of reading these two fields and the three mutating operations done inside the lock (allocating, copying, setting host dirty to false), keeping in mind the compiler's freedom to move the set host dirty call within the locked section, and potential changes future maintainers make within the locked section (they will assume it's locked, so it's fine to have all sorts of weird intermediate states). In this case I think they all work out OK, but it sure is tricky to think about.

Another option would be to not have a global lock, and instead attach the lock to the buffer itself using a bit in the flags field as a cas-loop-based spinlock to protect buffer device state. Maybe that's more robust. In the case we have in practice there would be no contention on this lock.

Eager to hear thoughts from @zvookin in particular.

Mar 06 '25 19:03 abadams

For context, using the example of halide_copy_to_device, this is the lock that's been alluded: https://github.com/halide/Halide/blob/7356c6df9e2bfed9637183a2bd114ce8adba1de8/src/runtime/device_interface.cpp#L210 and these are the most "critical" places where buf needs to be inspected or altered: https://github.com/halide/Halide/blob/7356c6df9e2bfed9637183a2bd114ce8adba1de8/src/runtime/device_interface.cpp#L176 https://github.com/halide/Halide/blob/7356c6df9e2bfed9637183a2bd114ce8adba1de8/src/runtime/device_interface.cpp#L186 https://github.com/halide/Halide/blob/7356c6df9e2bfed9637183a2bd114ce8adba1de8/src/runtime/device_interface.cpp#L191 https://github.com/halide/Halide/blob/7356c6df9e2bfed9637183a2bd114ce8adba1de8/src/runtime/device_interface.cpp#L193

Mar 06 '25 22:03 slomp

One possible solution would be to have a small "associative-set" of mutexes (mutices?) and obtain the corresponding mutex by looking-up into the set using the buffer pointer as a key. This should at the very least reduce contention when the interface is called concurrently on different buffers.

Mar 06 '25 22:03 slomp

Can we duplicate the dirty bits into the top bits of the device field? The device field shouldn't ever go beyond 48 bits used, right? Plenty of bits there to make copy_to_host() only depend on one field to early-out, if we simply duplicate relevant state there. But that would be API breaking I guess... :shrug:

Mar 10 '25 22:03 mcourteaux

We may just be overthinking and being overly pessimistic about concurrency issues here. If we look at the actual user interface: https://github.com/halide/Halide/blob/7356c6df9e2bfed9637183a2bd114ce8adba1de8/src/runtime/HalideRuntime.h#L1601-L1603 there's no mutual exclusion there in set_flags. Unless the runtime itself is doing concurrent stuff in those buffer fields, all bets are off already anyway. And if the runtime has the potential to do concurrent harm, well, we should be able to fix it with some careful memory ordering.

Mar 10 '25 23:03 slomp

Notes from the discussion at the dev meeting:

A potential failure case demonstrating why we need this lock is if we can construct a situation where a pipeline does a device api operation (e.g. copy_to_device) on an input or output buffer inside a parallel loop.
A CAS loop on a bit in the flags field isn't great, because some of these operations are heavy-weight (e.g. copy_to_host blocks until the kernel is complete!) so a waiting thread might spin for a long time.
A CAS loop on a bit in the flags field that gives up after some number of attempts and sleeps on a global condition variable might be fine. It would be signaled in the places the lock is currently released. This approach feels redundant with a lot of the machinery in synchronization_common.h, but I'm not sure how to just leverage that directly.
A reader-writer lock might be nice, but is out of scope for now.
halide_buffer_t::set_flag might need to be a cas loop too, because users could call that directly and right now it's a read-modify-write.

Mar 14 '25 21:03 abadams