Add a Partition Store snapshot restore policy
Stacked PRs:
- ->#1999
- #1998
Add a Partition Store snapshot restore policy
Test Results
5 files ±0 5 suites ±0 2m 32s :stopwatch: -20s 45 tests ±0 44 :white_check_mark: - 1 1 :zzz: +1 0 :x: ±0 113 runs - 1 111 :white_check_mark: - 3 2 :zzz: +2 0 :x: ±0
Results for commit 5c6fc5fe. ± Comparison against base commit d2ca091c.
This pull request removes 2 and adds 2 tests. Note that renamed tests count towards both.
dev.restate.sdktesting.tests.RunRetry ‑ withExhaustedAttempts(Client)
dev.restate.sdktesting.tests.RunRetry ‑ withSuccess(Client)
dev.restate.sdktesting.tests.AwaitTimeout ‑ timeout(Client)
dev.restate.sdktesting.tests.RawHandler ‑ rawHandler(Client)
:recycle: This comment has been updated with latest results.
Testing Start a restate-server with config:
With this change we introduce the ability to enable restoring a snapshot when the partition store is empty. We can test this by dropping the partition column family and re-starting restate-server with restore enabled.
Create snapshot:
> restatectl snapshots
Partition snapshots
Usage: restatectl snapshots [OPTIONS] <COMMAND>
Commands:
create-snapshot Create [aliases: create]
help Print this message or the help of the given subcommand(s)
Options:
-v, --verbose... Increase logging verbosity
-q, --quiet... Decrease logging verbosity
--table-style <TABLE_STYLE> Which table output style to use [default: compact] [possible values: compact, borders]
--time-format <TIME_FORMAT> [default: human] [possible values: human, iso8601, rfc2822]
-y, --yes Auto answer "yes" to confirmation prompts
--connect-timeout <CONNECT_TIMEOUT> Connection timeout for network calls, in milliseconds [default: 5000]
--request-timeout <REQUEST_TIMEOUT> Overall request timeout for network calls, in milliseconds [default: 13000]
--cluster-controller <CLUSTER_CONTROLLER> Cluster Controller host:port (e.g. http://localhost:5122/) [default: http://localhost:5122/]
-h, --help Print help (see more with '--help')
> restatectl snapshots create -p 1
Snapshot created: snap_12PclG04SN8eVSKYXCFgXx7
Server writes snapshot on-demand:
2024-09-26T07:31:49.261080Z INFO restate_admin::cluster_controller::service
Create snapshot command received
partition_id: PartitionId(1)
on rs:worker-0
2024-09-26T07:31:49.261133Z INFO restate_admin::cluster_controller::service
Asking node to snapshot partition
node_id: GenerationalNodeId(PlainNodeId(0), 3)
partition_id: PartitionId(1)
on rs:worker-0
2024-09-26T07:31:49.261330Z INFO restate_worker::partition_processor_manager
Received 'CreateSnapshotRequest { partition_id: PartitionId(1) }' from N0:3
on rs:worker-9
in restate_core::network::connection_manager::network-reactor
peer_node_id: N0:3
protocol_version: 1
task_id: 32
2024-09-26T07:31:49.264763Z INFO restate_worker::partition::snapshot_producer
Partition snapshot written
lsn: 3
metadata: "/Users/pavel/restate/test/n1/db-snapshots/1/snap_12PclG04SN8eVSKYXCFgXx7/metadata.json"
on rt:pp-1
Sample metadata file: snap_12PclG04SN8eVSKYXCFgXx7/metadata.json
{
"version": "V1",
"cluster_name": "snap-test",
"partition_id": 1,
"node_name": "n1",
"created_at": "2024-09-26T07:31:49.264522000Z",
"snapshot_id": "snap_12PclG04SN8eVSKYXCFgXx7",
"key_range": {
"start": 9223372036854775808,
"end": 18446744073709551615
},
"min_applied_lsn": 3,
"db_comparator_name": "leveldb.BytewiseComparator",
"files": [
{
"column_family_name": "",
"name": "/000030.sst",
"directory": "/Users/pavel/restate/test/n1/db-snapshots/1/snap_12PclG04SN8eVSKYXCFgXx7",
"size": 1267,
"level": 0,
"start_key": "64650000000000000001010453454c46",
"end_key": "667300000000000000010000000000000002",
"smallest_seqno": 11,
"largest_seqno": 12,
"num_entries": 0,
"num_deletions": 0
},
{
"column_family_name": "",
"name": "/000029.sst",
"directory": "/Users/pavel/restate/test/n1/db-snapshots/1/snap_12PclG04SN8eVSKYXCFgXx7",
"size": 1142,
"level": 6,
"start_key": "64650000000000000001010453454c46",
"end_key": "667300000000000000010000000000000002",
"smallest_seqno": 0,
"largest_seqno": 0,
"num_entries": 0,
"num_deletions": 0
}
]
}
> restatectl snapshots create -p 0
Optionally, we can also trim the log to prevent replay from Bifrost.
> restatectl logs trim -l 0 -t 1000
With Restate stopped, we drop the partition store:
> rocksdb_ldb drop_column_family --db=../test/n1/db data-0
Using this config:
[worker]
snapshot-restore-policy = "on-init"
When Restate server comes up, we can see that it successfully restores from the latest snapshot:
2024-09-27T15:39:27.704350Z INFO restate_partition_store::partition_store_manager
Restoring partition from snapshot
partition_id: PartitionId(0)
snapshot_id: snap_16mzxFw4Ve8MPbfVRKOwBON
lsn: Lsn(9636)
on rt:pp-0
2024-09-27T15:39:27.704415Z INFO restate_partition_store::partition_store_manager
Initializing partition store from snapshot
partition_id: PartitionId(0)
min_applied_lsn: Lsn(9636)
on rt:pp-0
2024-09-27T15:39:27.717951Z INFO restate_worker::partition
PartitionProcessor starting up.
on rt:pp-0
in restate_worker::partition::run
partition_id: 0
I'm not sure how this ties into the bigger picture for partition store recovery, so maybe we should hide the configuration option until we have an end-to-end design specced out. The primary unanswered question is about who makes the decision and where does the knowledge about the snapshot come from. One option is cluster controller passing this information down through the attachment plan, or whether it's self-decided like how you are proposing here.
I can see one fallback strategy which follows your proposal, i.e. if we don't have a local partition store, and we didn't get information about a snapshot to restore, then try and fetch one. But I guess we'll need to check the trim point of the log to figure if the snapshot we have is good enough or not before we commit to being a follower or leader.
In chatting with @tillrohrmann this morning, we figured that it's probably better to park this PR for now until we have a better idea about how the bootstrap process will fit in with the cluster control plane overall. This was useful to demo that restoring partition stores works, but it's likely not the long-term experience we want.
Closing this for now, will reopen with a clearer picture of how the CP will manage these once we get back to worker bootstrap.