opensearch-k8s-operator
opensearch-k8s-operator copied to clipboard
[Proposal] Add snapshot capability to the OpenSearch cluster.
Background:
As part of the next phase roadmap snapshot capability will be added to the operator, to create the OpenSearch cluster snapshot via the base configured yaml.
Proposal:
snapshot:
type: s3
snapshot_repo: <CUSTOM_NAME>
##User required settings to connect to s3
settings:
bucket: my-bucket
another_setting: setting-value
region: us-east-1
base_path: os-snapshot
Design
With the install custom plugins capability , its possible to now install repository-s3 (pluginsList: ["repository-s3"]), using this plugin the initial start is to add snapshot capability to the operator to store the snapshots to the s3 bucket.
Once the snasphot of type s3 is added to the yaml and with user configured settings, an API call will be invoked to the cluster.
Example
PUT "https://localhost:9200/_snapshot/my_s3_repository_1?pretty" -H 'Content-Type: application/json' -d' { "type": "s3", "settings": { "bucket": "opensearch-s3-snapshot", "region": "us-east-1", "base_path": "os-snapshot" } } '
Assumptions:
- The repository-s3 plugin is pre-installed by the user.
- The s3 bucket created and recommended s3 permissions are handled by the user. More details: https://opensearch.org/docs/1.2/opensearch/snapshot-restore/
Sample AWS IAM policy to be added to the node role
{
"Statement": [
{
"Action": [
"s3:ListBucket",
"s3:GetBucketLocation",
"s3:ListBucketMultipartUploads",
"s3:ListBucketVersions"
],
"Effect": "Allow",
"Resource": [
"arn:aws:s3:::<BUCKET_NAME>"
]
},
{
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject",
"s3:AbortMultipartUpload",
"s3:ListMultipartUploadParts"
],
"Effect": "Allow",
"Resource": [
"arn:aws:s3:::<BUCKET_NAME>/*"
]
}
],
"Version": "2012-10-17"
}
Snapshot Management and Shared responsibilities
- The base snapshot setup will be created and done by the operator.
- Users can work on custom CronJob to invoke the snapshots periodically, example as
PUT _snapshot/my_s3_repository_1/%3Csnapshot-%7Bnow%2Fd%7D%3E"
apiVersion: batch/v1
kind: CronJob
metadata:
name: opensearch-snapshot-cron
spec:
schedule: "@daily"
concurrencyPolicy: Forbid
jobTemplate:
spec:
template:
spec:
containers:
- name: opensearch-snapshot-cron
image: centos:7
command:
- /bin/bash
args:
- -c
## The following can be better handled with jq to fetch the {"accepted":true}
- 'curl -s -i -k -u admin:admin -XPUT "https://<SERVICE_NAME>:<PORT>/_snapshot/_snapshot/my_s3_repository_1/%3Csnapshot-%7Bnow%2Fd%7D%3E"'
restartPolicy: OnFailure
- Users can use the snapshot management API to manage policies for the time and frequency of automatic snapshots
Future Enhancement
- Based on the usage and request shared-file-system capability will be added.
If I may suggest it might be worth making the proposed snapshot config accept a list. This way it would be possible to configure more than one snapshot policy, i.e., for snapshotting to different repositories (e.g., S3 + GCS) or snapshotting specific index patterns.
Also ideally the 4 repository plugins listed in the official docs should be supported (which should be easy enough given that only the repository settings would be different)
- repository-azure
- repository-gcs
- repository-hdfs
- repository-s3
My thoughts on this:
- It should be possible to configure multiple repositories
- At least s3, azure and gcs repositories should be supported so we cover the big 3 cloud providers
- Not sure if this needs to be in the first iteration but we should offer users a way to automatically get regular snapshots (basically the operator should create the cronjob for them)
- The operator should report an error to the user if the needed repository plugin is not in the plugin list
Hey @max-frank thanks, as @swoehrl-mw mentioned we can start with s3, azure and gcs repositories, but what I propose is the method to use of snapshot: type: <cloud_provider> followed by the user configured settings (used s3 as an example above).
Aslo @swoehrl-mw yes initial roll out, keeping simple I'm planning to just add the snapshot capability, also I would keep cronjob decision to the user, as there is always an _sm policy a user can configure from dashboard, the operator need not manage this overhead.https://opensearch.org/docs/latest/opensearch/snapshots/sm-api/. Also from cronjob instead of triggering an -XPUT API, its far better to do it via dashboard using snapshot-management policy, if not user can always extend to create a cron job to manage the snapshots. WDYT? @segalziv @idanl21 @dbason
@prudhvigodithi : I suggest to change the config yaml structure a bit:
snapshot:
repository:
name: <CUSTOM_NAME>
type: s3
settings:
##User required settings to connect to s3
bucket: my-bucket
another_setting: setting-value
region: us-east-1
base_path: os-snapshot
That way if we later decide to add options to configure snapshot schedules and the like we can add it as a key like snapshot.schedules.
Something to also consider in terms of backups is the new remote backend storage feature released with 2.3.
While the feature is still experimental for now it would probably a good idea to consider it during design of the configuration here since it also is based on the repositories. Just to make sure that whatever format is chosen here can also support the eventual addition of remote backend storage.
One additional thing this needs for s3-compatible blob stores is setting the endpoint, etc. This has to be done in the opensearch.yml afaict.
https://opensearch.org/docs/latest/opensearch/snapshots/snapshot-restore#register-repository
Added on last release (v2.3.0) as BETA feature. Closing the issue, Please open new one with implementations that left for GA