risingwave
risingwave copied to clipboard
discussion: associate notification version with epoch
The notification mechanism is completely asynchronous for now, as it was initially introduced for catalog updates. However, we've found some cases where we must build an association between the notification version and the system epoch.
Batch query when scaling
We prefer to schedule the scan tasks to the corresponding worker(parallel unit) by vnode mapping of the Materialize executors. After we support the barrier and checkpoint decoupling (#4966), this assumption must be kept. Besides, the in-memory state store e2e (#5322) also tests this.
It's possible that a batch query is scheduled when scaling, where we get an epoch after the scaling, while the vnode mapping from the notification manager has not received updates yet. As the uncheckpointed data is invisible to other compute nodes, we will fail to scan the data from some migrated/rebalanced partitions.
Batch query when dropping
Similar to the above, we may be able to scan the dropped table whose data is being cleaned-up by the storage, due to asynchronous updates of catalog deletion in this frontend. This may cause wrong results or even break some assumptions in batch executors.
Solution
As epoch maintains the sequential consistency of DDLs and configuration changes, we may associate a "minimal notification version" with each epoch. After the frontend pins an epoch for a batch task, it must wait for this version to be synced before scheduling.
cc @liurenjie1024 @yezizp2012 @zwang28 @st1page 🤩
Strong +1.
Btw, do we have a case that notification version increases without epoch increases? In other words, can we use epoch as the notification version?
Btw, do we have a case that notification version increases without epoch increases? In other words, can we use epoch as the notification version?
Yes, for those changes not relevant to materialized views, like create/drop source, create/drop database, or even hummock version updates?
I think we must find a way to update & read a snapshot of the combination of Hummock snapshot (epoch) and the vnode mapping atomically. Currently, they're totally independent in both the frontend and the meta committing the barrier. However, any case of mismatch will lead to some problems.
- Old (
current) snapshot, new mapping: we'lltry_wait_epochfor aCurrentepoch on a wrong partition, which will fail on assertion. https://github.com/risingwavelabs/risingwave/blob/75468b4104ede915588173b9b0373489a5ceb5d2/src/storage/src/hummock/local_version/local_version_manager.rs#L226-L235 - New (
current) snapshot, old mapping: some data will be invisible remotely, which leads to wrong results.
So I guess some major refactoring might be necessary. 🤔 cc @hzxa21 @yezizp2012 @st1page
@fuyufjh
Is this closed by #5999?
Is this closed by #5999?
Nope. They're unrelated. I think this might be closed by #6250.
Can we close it with #7042? cc @zwang28
Can we close it with https://github.com/risingwavelabs/risingwave/pull/7042? cc @zwang28
#7024 addresses the "Batch query when scaling" case.
I'm checking whether the "Batch query when dropping" case is correctly handled. Will open another tracking issue when required.