astra
astra copied to clipboard
Rollover current chunk during shutdown
Summary
Every time we do an an indexer deploy, data goes missing for ~10 mins till the recovery task completes. Our users notice this and we want to address this.
Our k8's timeout for a pod graceful shutdow can be set to 180s max internally, after which the pod will be killed forcibly.
Astra now uses S3 native library (CRT client) which makes uploading/downloading chunks from S3 at close to the theoretical max of the underlying host.
Quick math:
We configure a chunk rollover size is 15GB. Let's assume the worst case we are almost at the limit when the shutdown hook is called.
We run our indexers on r5d.24xlarge
nodes which have a 25 Gbps Network Bandwidth. Let's assume we have 25 indexer pods on this host that need to upload their chunks before shutdown.
meaning each pod gets roughly 1Gbps on average. 15GB @ 1Gbps comes to exactly 2 mins.
Worst case scenario of 15GB chunk we may or may not make it. But if you take an average chunk size of 10GB when the shutdown hook is called we have a very good chance of succeeding.
~Currently the effort is blocked by https://github.com/aws/aws-sdk-java-v2/issues/3963~
That bug report is resolved. So during startup we can call CRT.acquireShutdownRef();
and then in the shutdownhook after we've uploaded data to S3 call CRT.releaseShutdownRef();
This PR is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 3 days.
In the past this change was not possible since the k8s timeout couldn't be longer than 30secs after a shutdown was issued. Are longer timeouts on k8s not longer an issue?
Alternatively, a better solution is to incrementally upload lucene segments to S3? That would solve 2 problems for us:
- On a shutdown we would need to upload newly created chunks and not all the chunks. This makes the shutdowns and deployments fast.
- This incremental chunk upload functionality can be used for features like making the indexer highly available in the future.
This PR is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 3 days.