YAKL
YAKL copied to clipboard
Potentially un-needed mutex locking affecting OpenMP Performance
I have a relatively small test program that runs at a very respectable speed on GPU, but is ~1000x slower on the OpenMP backend (not expected behaviour).
From very scientific stopping the program in a debugger lots of times, I've noticed that it keeps hitting the YAKL mutex. This is due to me slicing and passing non-owned Arrays around. These are performant on the GPU, but should be (essentially) mdspans on the CPU. Slices created via .slice
incur this penalty, as do all these temporary arrays when going out of scope (the mutex is grabbed in deallocate, whether or not reference count is not null).
Is this something you'd be interested in adjusting? I'm happy to put some work in, but changing slices to be non reference counted would obviously be a significant API change on your end (in terms of lifetimes). I can always define my own slicing behaviour and get reasonable performance if you are happy moving the mutex in deallocate inside a check on the reference count.