zetcd
zetcd copied to clipboard
correctly account for ephemeral node expiration in parent znode stats
Spun off of #88.
zetcd uses the CVersion key's revision and version to compute the znode's Pzxid and CVersion respectively. When a child changes (e.g., creation, deletion), it touches the CVersion key to bump these values. Ephemeral key expiration uses etcd lease expiration, so it does not touch CVersion when it is deleted.
One possible solution involves extending etcd to associate a transaction with a lease (cf. https://github.com/coreos/etcd/issues/8842). Ideally, each ephemeral key would have a lease transaction that would touch its parent's CVersion key. This is probably expecting too much since it is too invasive on the etcd side; the txn logic would have to permit multiple updates to a key in the same revision and likely require deep mvcc changes. Alternatively, new "deleted ephemeral" keys could be created in the lease txn to mark tombstones for each expired key; the tombstones would then be used for reconciling the fields. Tombstones avoid multi-updates, but would need STM extensions for ranges (a feature request made a few times in the past, but only possible in 3.3+).
An approach with reconciliation but without lease txns: maintain a per-znode list of ephemeral children (elist), a per-ephemeral node key with a matching ephemeral owner (ekey), and a global revision offset key:
- When creating an ephemeral key, add name to elist and create ekey if key does not exist. Wait on reconciliation if already in the elist.
- When computing Stat, fetch the elist and compare with the child keys to detect expiry and wait for reconciliation.
- A reconciliation goroutine watches for ekey deletion events. For each set of deleted ekeys under the same znode, set
CVersion's count to the count-1, its zxid to the deletion event zxid and the current revision offset version, remove the keys from the elist, and touch the revision offset key. Notify waiters. - The revision offset is subtracted from the current zxid to compensate for the extra revisions from reconciliation txns.
- Record the current revision offset in the mtime and ctime keys for computing
mzxidandczxid. Compute via etcdrev-offset. - Record a count and the current revision offset in
CVersion. - Compute
CVersionby adding the stored count value to the key version. - Compute
PZxidby using the storedCVersionzxid if no changes since last expiry - Will need some way to handle losing the reconciliation watch due to compaction.