astra icon indicating copy to clipboard operation
astra copied to clipboard

Rollover current chunk during shutdown

Open vthacker opened this issue 9 months ago • 1 comments

Summary

Every time we do an an indexer deploy, data goes missing for ~10 mins till the recovery task completes. Our users notice this and we want to address this.

Our k8's timeout for a pod graceful shutdow can be set to 180s max internally, after which the pod will be killed forcibly.

Astra now uses S3 native library (CRT client) which makes uploading/downloading chunks from S3 at close to the theoretical max of the underlying host.

Quick math:

We configure a chunk rollover size is 15GB. Let's assume the worst case we are almost at the limit when the shutdown hook is called.

We run our indexers on r5d.24xlarge nodes which have a 25 Gbps Network Bandwidth. Let's assume we have 25 indexer pods on this host that need to upload their chunks before shutdown.

meaning each pod gets roughly 1Gbps on average. 15GB @ 1Gbps comes to exactly 2 mins.

Worst case scenario of 15GB chunk we may or may not make it. But if you take an average chunk size of 10GB when the shutdown hook is called we have a very good chance of succeeding.

vthacker avatar May 09 '24 23:05 vthacker

~Currently the effort is blocked by https://github.com/aws/aws-sdk-java-v2/issues/3963~

That bug report is resolved. So during startup we can call CRT.acquireShutdownRef(); and then in the shutdownhook after we've uploaded data to S3 call CRT.releaseShutdownRef();

vthacker avatar May 22 '24 02:05 vthacker

This PR is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 3 days.

github-actions[bot] avatar Jun 22 '24 01:06 github-actions[bot]

In the past this change was not possible since the k8s timeout couldn't be longer than 30secs after a shutdown was issued. Are longer timeouts on k8s not longer an issue?

Alternatively, a better solution is to incrementally upload lucene segments to S3? That would solve 2 problems for us:

  • On a shutdown we would need to upload newly created chunks and not all the chunks. This makes the shutdowns and deployments fast.
  • This incremental chunk upload functionality can be used for features like making the indexer highly available in the future.

mansu avatar Jun 28 '24 06:06 mansu

This PR is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 3 days.

github-actions[bot] avatar Jul 29 '24 01:07 github-actions[bot]