sui
sui copied to clipboard
Shared object sequencing at beginning of epoch is broken (and owned->shared upgrades broken at all times)
The code at https://github.com/MystenLabs/sui/blob/main/crates/sui-core/src/authority/authority_store.rs#L1160 is incorrect:
None => self
// TODO: if we use an eventually consistent object store in the future,
// we must make this read strongly consistent somehow!
.get_latest_parent_entry(*id)?
.map(|(objref, _)| objref.1)
.unwrap_or_else(|| *initial_shared_version),
There are two things happening here, both of which are necessary, but they conflict with each other.
-
When a new validator joins the committee at the beginning of an epoch (or if a returning validator has wiped its sequence tables), it must read the latest version of the object (done here with
get_latest_parent_entry()) in order to determine what version to assign next. There are two problems with this: a. (minor) it assumes the object store allows consistent reads after the end of an epoch. b. (major) get_latest_parent_entry may be updated due to a cert being executed via checkpoint sync (handle_certificate_with_effects) before the first narwhal message is processed. -
When an owned object is upgraded to a shared object, we need to use the
initial_shared_version. But we can't tell that we are in the upgrade path rather than the beginning-of-new-epoch path, so theinitial_shared_versionwill never be consulted in the case of an upgrade - which was the entire purpose of havinginitial_shared_version
@tnowacki @amnn Can we reconsider owned->shared upgrades? Could we possibly ban them?
Potential fixes:
- To handle the case where a new validator joined in the epoch, and first got synced through a checkpoint before seeing the cert from narwhal: if the transaction has already been executed previously (through checkpoint sync path) when we are calling
lock_shared_objects, we could read out the effects, and get the next version of this shared object, and use that as the next scheduled version. This however depends on that object table update is atomic with effect table update. If they are not, then there could be potential data race where the object table (and hence parent sync table) is updated but effects table is not. - To handle the case where owned object got upgraded to shared: we should first load the object from the object store. If it's not there or if it's an owned object, then we use the init version.
I need to page the details back in but I believe (@gdanezis, help me out) the introduction of Lamport timestamps may reintroduce the need for the initial_shared_version check even in the absence of upgrades.
Correct. We still need initial_shared_version to handle the case of a newly created shared object.
Yep, just talked to @mystenmark about this offline (the case where it's needed is that a shared object's initial version won't always be 1 with the introduction of lamport timestamps), but this doesn't prevent us pursuing the idea of disabling owned -> shared upgrades, because this use of initial_shared_version is not going to be inconsistent with the value in parent sync (i.e. if there is a value in parent sync, it will be the initial shared version or greater).
So should we just do more or less the following?
let initial_shared_version = *initial_shared_version;
self
.get_latest_parent_entry(*id)?
// add a max here
.map(|(objref, _)| max(objref.1, initial_shared_version))
.unwrap_or(initial_shared_version),
If the object is already shared, the parent sync version will be >= the initial version. If it is not shared in the parent sync (or is otherwise is not present), it will be < the initial version, so we should take the initial version.
I think we don't need to read the object table then? As @lxfind suggested.
I still don't fully understand this cert/narwhal issue though. A quick chat might be helpful for me.
If the object is already shared, the parent sync version will be >= the initial version. If it is not shared in the parent sync (or is otherwise is not present), it will be < the initial version, so we should take the initial version.
Yeah that sounds correct. Good one.
So @tnowacki's proposal should be compatible with owned -> shared upgrades as well right?
Personally I do prefer we should ban ownd->shared upgrades for now, just to have less things to worry about. We could support it post-mainnet.
I don't have a strong opinion on keeping or banning upgrades, but it doesn't seem like it's used in practice and I haven't heard anyone championing it, so that's fine by me. But if this bug can be fixed with Todd's suggestion without banning it, I think it's worth decoupling that change. That was the motivation for my question.
Maybe we should put in the bug fix and ban it?
Quoting @gdanezis: "The issue is really mundane: we use (ObjectId, version) to do our concurrency control and ensure safety. Owned objects and shared objects use vastly different logics for this (fastpay vs fastpay+consensus). So when we change the type from owned to shared this creates all sorts of issues to ensure we hand over from one mechanism to another safely. The issue does not present itself for wrapped or child objects since they both live on the fastpay side of things in all cases. One day we may work out and have confidence we understand all the corner cases, and allow promotion from owned -> shared. But I would rather remove this complexity right now. So this is about being easier to restrict now, and relax this later rather than getting right immediately."
And does @mystenmark's problem (1) still exist even without upgrades? Do we have a fix for that? (sorry, bit lost on that one)
Should be fixed by #5873 and #5874
I don't think we ever finished fixing case 1.