risingwave discussion: associate notification version with epoch

trafficstars

The notification mechanism is completely asynchronous for now, as it was initially introduced for catalog updates. However, we've found some cases where we must build an association between the notification version and the system epoch.

Batch query when scaling

We prefer to schedule the scan tasks to the corresponding worker(parallel unit) by vnode mapping of the Materialize executors. After we support the barrier and checkpoint decoupling (#4966), this assumption must be kept. Besides, the in-memory state store e2e (#5322) also tests this.

It's possible that a batch query is scheduled when scaling, where we get an epoch after the scaling, while the vnode mapping from the notification manager has not received updates yet. As the uncheckpointed data is invisible to other compute nodes, we will fail to scan the data from some migrated/rebalanced partitions.

Batch query when dropping

Similar to the above, we may be able to scan the dropped table whose data is being cleaned-up by the storage, due to asynchronous updates of catalog deletion in this frontend. This may cause wrong results or even break some assumptions in batch executors.

Solution

As epoch maintains the sequential consistency of DDLs and configuration changes, we may associate a "minimal notification version" with each epoch. After the frontend pins an epoch for a batch task, it must wait for this version to be synced before scheduling.

cc @liurenjie1024 @yezizp2012 @zwang28 @st1page 🤩

Sep 20 '22 07:09 BugenZhao

Strong +1.

Btw, do we have a case that notification version increases without epoch increases? In other words, can we use epoch as the notification version?

Sep 20 '22 10:09 hzxa21

Btw, do we have a case that notification version increases without epoch increases? In other words, can we use epoch as the notification version?

Yes, for those changes not relevant to materialized views, like create/drop source, create/drop database, or even hummock version updates?

Sep 20 '22 10:09 BugenZhao

I think we must find a way to update & read a snapshot of the combination of Hummock snapshot (epoch) and the vnode mapping atomically. Currently, they're totally independent in both the frontend and the meta committing the barrier. However, any case of mismatch will lead to some problems.

Old (current) snapshot, new mapping: we'll try_wait_epoch for a Current epoch on a wrong partition, which will fail on assertion. https://github.com/risingwavelabs/risingwave/blob/75468b4104ede915588173b9b0373489a5ceb5d2/src/storage/src/hummock/local_version/local_version_manager.rs#L226-L235
New (current) snapshot, old mapping: some data will be invisible remotely, which leads to wrong results.

So I guess some major refactoring might be necessary. 🤔 cc @hzxa21 @yezizp2012 @st1page

Oct 24 '22 07:10 BugenZhao

@fuyufjh

Oct 25 '22 03:10 fuyufjh

Is this closed by #5999?

Nov 08 '22 06:11 fuyufjh

Is this closed by #5999?

Nope. They're unrelated. I think this might be closed by #6250.

Nov 08 '22 07:11 BugenZhao

Can we close it with #7042? cc @zwang28

Jan 30 '23 03:01 BugenZhao

Can we close it with https://github.com/risingwavelabs/risingwave/pull/7042? cc @zwang28

#7024 addresses the "Batch query when scaling" case.

I'm checking whether the "Batch query when dropping" case is correctly handled. Will open another tracking issue when required.

Jan 30 '23 05:01 zwang28

risingwave risingwave copied to clipboard

discussion: associate notification version with epoch

Batch query when scaling

Batch query when dropping

Solution

risingwave
risingwave copied to clipboard