parsec
parsec copied to clipboard
Support for oversubscription broken
Describe the bug
We see the following warnings (followed by asserts in debug mode) when memory on the device is tight:
W@00000 GPU[hip(0)]: Write access to data copy 0x7fbe35bdbb10 [ref_count 1] with existing readers [1024] (possible anti-dependency,
or concurrent accesses), please prevent that with CTL dependencies
The 1024 is suspicious and points us to #575. @therault and I found that the rollback of the CAS is wrong. The CAS is done on an element that we will abandon and is only there to block someone from taking the element. There is no need to rollback the CAS.
Once we have released the LRU element, we go back to malloc_data
. Now there is a pretty good chance that the zone_alloc
succeeds. We still have PARSEC_CUDA_DATA_COPY_ATOMIC_SENTINEL
as copy_readers_update
, which will then be applied to the gpu_elem
at the end.
I think it's safe to remove everything to do with copy_readers_update
(i.e., the fetch-and-op and all places where we set it) as the readers
field in the final gpu_elem
does not need to be adjusted.