Ian Henriksen
Ian Henriksen
They suggested using the per-thread default stream as a potential workaround. I initially didn't think that'd work, but since we already are synchronizing the stream at the end of each...
Right, and I'd be somewhat surprised if HIP didn't support this eventually. It's actually a very natural idiom. OTOH, who knows. Sometimes upstream projects still do surprising things.
Yah, I don't know to what extent HIP is mimicking CUDA vs fixing its failings.
https://github.com/cupy/cupy/blob/596d1af53b5793d3d52994c8b493ff42be453a8d/cupy_backends/cuda/stream.pyx#L10 seems to imply that this is something we'd have to set _before_ importing cupy. Other than that, it seems like a reasonable short-term fix until we can get https://github.com/numba/numba/issues/6921...
Just having to import parla.gpu before cupy to fix this isn't perfect, but it's *way* better than having to shuttle around stream objects manually.
Actually it looks like they still don't support per-thread default streams. For some reason I misread their issue as a pull request. It looks like PTDS is the easier fix...
Relevant discussion in https://github.com/numba/numba/issues/5137. PTDS may be available upstream very soon!
I agree that the per-thread default stream idea isn't ideal in general without a way to set it. On the other hand, it'll be good enough for our use-case regardless...