parsec
parsec copied to clipboard
Support for oversubscription broken
Describe the bug
We see the following warnings (followed by asserts in debug mode) when memory on the device is tight:
W@00000 GPU[hip(0)]: Write access to data copy 0x7fbe35bdbb10 [ref_count 1] with existing readers [1024] (possible anti-dependency,
or concurrent accesses), please prevent that with CTL dependencies
The 1024 is suspicious and points us to #575. @therault and I found that the rollback of the CAS is wrong. The CAS is done on an element that we will abandon and is only there to block someone from taking the element. There is no need to rollback the CAS.
Once we have released the LRU element, we go back to malloc_data. Now there is a pretty good chance that the zone_alloc succeeds. We still have PARSEC_CUDA_DATA_COPY_ATOMIC_SENTINEL as copy_readers_update, which will then be applied to the gpu_elem at the end.
I think it's safe to remove everything to do with copy_readers_update (i.e., the fetch-and-op and all places where we set it) as the readers field in the final gpu_elem does not need to be adjusted.
I don't think this analysis is correct.
- Nobody can take that element. This entire function is done in the context of the thread handling the current device (where the copy is located), so is protected. What that CAS is protecting from, is from another thread trying to use the copy as source for a device-to-device transfer (this is not ownership).
- We do not abandon the copy, we detach it from the old master and then we repurpose it for another data. Once this done, the readers shall be 0 again.
- When we go back to
malloc_datathe first thing we do is to reset thecopy_readers_updateto zero