restate Add a Partition Store snapshot restore policy

Stacked PRs:

->#1999
#1998

Add a Partition Store snapshot restore policy

Sep 27 '24 15:09 pcholakov

Test Results

5 files ±0 5 suites ±0 2m 32s :stopwatch: -20s 45 tests ±0 44 :white_check_mark: - 1 1 :zzz: +1 0 :x: ±0 113 runs - 1 111 :white_check_mark: - 3 2 :zzz: +2 0 :x: ±0

Results for commit 5c6fc5fe. ± Comparison against base commit d2ca091c.

This pull request removes 2 and adds 2 tests. Note that renamed tests count towards both.

dev.restate.sdktesting.tests.RunRetry ‑ withExhaustedAttempts(Client)
dev.restate.sdktesting.tests.RunRetry ‑ withSuccess(Client)

dev.restate.sdktesting.tests.AwaitTimeout ‑ timeout(Client)
dev.restate.sdktesting.tests.RawHandler ‑ rawHandler(Client)

:recycle: This comment has been updated with latest results.

Sep 27 '24 16:09 github-actions[bot]

Testing Start a restate-server with config:

With this change we introduce the ability to enable restoring a snapshot when the partition store is empty. We can test this by dropping the partition column family and re-starting restate-server with restore enabled.

Create snapshot:

> restatectl snapshots
Partition snapshots

Usage: restatectl snapshots [OPTIONS] <COMMAND>

Commands:
  create-snapshot  Create [aliases: create]
  help             Print this message or the help of the given subcommand(s)

Options:
  -v, --verbose...                               Increase logging verbosity
  -q, --quiet...                                 Decrease logging verbosity
      --table-style <TABLE_STYLE>                Which table output style to use [default: compact] [possible values: compact, borders]
      --time-format <TIME_FORMAT>                [default: human] [possible values: human, iso8601, rfc2822]
  -y, --yes                                      Auto answer "yes" to confirmation prompts
      --connect-timeout <CONNECT_TIMEOUT>        Connection timeout for network calls, in milliseconds [default: 5000]
      --request-timeout <REQUEST_TIMEOUT>        Overall request timeout for network calls, in milliseconds [default: 13000]
      --cluster-controller <CLUSTER_CONTROLLER>  Cluster Controller host:port (e.g. http://localhost:5122/) [default: http://localhost:5122/]
  -h, --help                                     Print help (see more with '--help')
> restatectl snapshots create -p 1
Snapshot created: snap_12PclG04SN8eVSKYXCFgXx7

Server writes snapshot on-demand:

2024-09-26T07:31:49.261080Z INFO restate_admin::cluster_controller::service
  Create snapshot command received
    partition_id: PartitionId(1)
on rs:worker-0
2024-09-26T07:31:49.261133Z INFO restate_admin::cluster_controller::service
  Asking node to snapshot partition
    node_id: GenerationalNodeId(PlainNodeId(0), 3)
    partition_id: PartitionId(1)
on rs:worker-0
2024-09-26T07:31:49.261330Z INFO restate_worker::partition_processor_manager
  Received 'CreateSnapshotRequest { partition_id: PartitionId(1) }' from N0:3
on rs:worker-9
  in restate_core::network::connection_manager::network-reactor
    peer_node_id: N0:3
    protocol_version: 1
    task_id: 32
2024-09-26T07:31:49.264763Z INFO restate_worker::partition::snapshot_producer
  Partition snapshot written
    lsn: 3
    metadata: "/Users/pavel/restate/test/n1/db-snapshots/1/snap_12PclG04SN8eVSKYXCFgXx7/metadata.json"
on rt:pp-1

Sample metadata file: snap_12PclG04SN8eVSKYXCFgXx7/metadata.json

{
  "version": "V1",
  "cluster_name": "snap-test",
  "partition_id": 1,
  "node_name": "n1",
  "created_at": "2024-09-26T07:31:49.264522000Z",
  "snapshot_id": "snap_12PclG04SN8eVSKYXCFgXx7",
  "key_range": {
    "start": 9223372036854775808,
    "end": 18446744073709551615
  },
  "min_applied_lsn": 3,
  "db_comparator_name": "leveldb.BytewiseComparator",
  "files": [
    {
      "column_family_name": "",
      "name": "/000030.sst",
      "directory": "/Users/pavel/restate/test/n1/db-snapshots/1/snap_12PclG04SN8eVSKYXCFgXx7",
      "size": 1267,
      "level": 0,
      "start_key": "64650000000000000001010453454c46",
      "end_key": "667300000000000000010000000000000002",
      "smallest_seqno": 11,
      "largest_seqno": 12,
      "num_entries": 0,
      "num_deletions": 0
    },
    {
      "column_family_name": "",
      "name": "/000029.sst",
      "directory": "/Users/pavel/restate/test/n1/db-snapshots/1/snap_12PclG04SN8eVSKYXCFgXx7",
      "size": 1142,
      "level": 6,
      "start_key": "64650000000000000001010453454c46",
      "end_key": "667300000000000000010000000000000002",
      "smallest_seqno": 0,
      "largest_seqno": 0,
      "num_entries": 0,
      "num_deletions": 0
    }
  ]
}
> restatectl snapshots create -p 0

Optionally, we can also trim the log to prevent replay from Bifrost.

> restatectl logs trim -l 0 -t 1000

With Restate stopped, we drop the partition store:

> rocksdb_ldb drop_column_family --db=../test/n1/db data-0

Using this config:

[worker]
snapshot-restore-policy = "on-init"

When Restate server comes up, we can see that it successfully restores from the latest snapshot:

2024-09-27T15:39:27.704350Z INFO restate_partition_store::partition_store_manager
  Restoring partition from snapshot
    partition_id: PartitionId(0)
    snapshot_id: snap_16mzxFw4Ve8MPbfVRKOwBON
    lsn: Lsn(9636)
on rt:pp-0
2024-09-27T15:39:27.704415Z INFO restate_partition_store::partition_store_manager
  Initializing partition store from snapshot
    partition_id: PartitionId(0)
    min_applied_lsn: Lsn(9636)
on rt:pp-0
2024-09-27T15:39:27.717951Z INFO restate_worker::partition
  PartitionProcessor starting up.
on rt:pp-0
  in restate_worker::partition::run
    partition_id: 0

Sep 27 '24 16:09 pcholakov

I'm not sure how this ties into the bigger picture for partition store recovery, so maybe we should hide the configuration option until we have an end-to-end design specced out. The primary unanswered question is about who makes the decision and where does the knowledge about the snapshot come from. One option is cluster controller passing this information down through the attachment plan, or whether it's self-decided like how you are proposing here.

I can see one fallback strategy which follows your proposal, i.e. if we don't have a local partition store, and we didn't get information about a snapshot to restore, then try and fetch one. But I guess we'll need to check the trim point of the log to figure if the snapshot we have is good enough or not before we commit to being a follower or leader.

Oct 01 '24 07:10 AhmedSoliman

In chatting with @tillrohrmann this morning, we figured that it's probably better to park this PR for now until we have a better idea about how the bootstrap process will fit in with the cluster control plane overall. This was useful to demo that restoring partition stores works, but it's likely not the long-term experience we want.

Oct 04 '24 08:10 pcholakov

Closing this for now, will reopen with a clearer picture of how the CP will manage these once we get back to worker bootstrap.

Oct 15 '24 15:10 pcholakov