cs431
cs431 copied to clipboard
[Question] epoch + 3 or epoch + 2?
Hi, I was studying the lecture on epoch based garbage collection, the lecture proves that when retire an object at epoch E, it is safe to free the object at E + 3 since the two "happens before" releation.
And I was also looking into the code of crossebeam-epoch, it seems that crossbeam-epoch has used E + 3 and reverts it back to E + 2:
- E + 3 https://github.com/crossbeam-rs/crossbeam/issues/238
- E + 3 https://github.com/crossbeam-rs/crossbeam/pull/416
- E + 2 https://github.com/crossbeam-rs/crossbeam/pull/517
I am not sure if this is the right place to ask, but I am confused that crossbeam-epoch reverted back to E + 2.
Looking into this rfc https://github.com/crossbeam-rs/rfcs/blob/master/text/2017-07-23-relaxed-memory.md and the code of the pr carefully, I think the essential difference is the remove of SC fence in unlink/push_bag.
Hi, sorry for late reply.
The essential difference between E+2 and E+3 is that the epoch consensus rule (concurrent epochs may differ by at most 1) doesn't hold in E+2.
Note that in pin, there can be some delay between loading the global epoch (loading global epoch is essentially an optimization for checking all the other thread's local epochs) and storing the local epoch. During this interval, the global epoch can increase multiple times without considering the thread currently being pinned, resulting in 'local epoch < global epoch - 1'. Therefore, if retire tags the garbage with the local epoch, the garbage might be considered immediately expired. E+2 fixes this issue by tagging garbage with global epoch (and SC fence).
On the other hand, pin in E+3 checks that the stored local epoch is not stale (note that this is quite similar to the validation loop in hazard pointers). This enforces the epoch consensus rule. So retire can tag the garbage with the local epoch instead of the global epoch, and no additional synchronization is needed.
The advantage of E+3 is simplicity. As you can see from the slide, its correctness is very intuitive. On the other hand, correctness proof for E+2 needs a bit more involved reasoning as described in Jeehoon's RFC. However, E+3's simplicity comes at the cost of making pin no longer wait-free due to the validation loop.
Then why revert E+3? It caused random segfaults in CI which IIRC weren't reproducible on our machines, and we couldn't figure out why.
oh actually it's reproducible in my laptop (Intel). ~~It seems it's not reproducible only in AMD machines.~~
https://github.com/tomtomjhj/crossbeam/commit/4522ab0db5bc43e106b6143b2933760cf20d6c6f seems to fix the issue.
@tomtomjhj would you please upstream the change?