helm-s3 icon indicating copy to clipboard operation
helm-s3 copied to clipboard

Race condition causes inconsistencies in chart repository's index.yaml

Open thedoctor opened this issue 7 years ago • 17 comments

helm s3 push CHART REPO has a race condition when multiple charts are pushed around the same time. It occurs in the following situation:

  1. UserA runs: helm s3 push CHART_A s3://REPO
  2. UserB runs: helm s3 push CHART_B s3://REPO
  3. UserA's process fetches s3://REPO/index.yaml
  4. UserB's process fetches s3://REPO/index.yaml
  5. UserA's process updates his fetched index.yaml with CHART_A's new version and replaces the remote s3://REPO/index.yaml with his updated version.
  6. UserB's process updates his fetched (and now out-of-date) index.yaml with CHART-B's new version and replaces the remote s3://REPO/index.yaml with his updated version which does not contain CHART_A.

At the end of this process, both CHART_A and CHART_B are present in the repository, but CHART_A is missing from the index so any downstream charts that require it will fail when running helm dep update CHART_THAT_DEPENDS_ON_CHART_A.

A simple solution would be for the plugin to create a mutex, e.g. index.yaml.lock before fetching index.yaml which it would delete after replacing index.yaml with the updated version. If the lockfile is already present, the plugin should wait until it has been deleted and a new one can be created before fetching index.yaml and proceeding. In the worst case, this could cause cascading delays if many charts are frequently updated, but slow is better than broken.

thedoctor avatar Dec 14 '17 00:12 thedoctor

Until a full-on fix arrives, is there a command or script that can regenerate the index.yaml file by listing the files in the repo?

markhu avatar Dec 14 '17 07:12 markhu

Hi!

@thedoctor Yes, there is an obvious race condition. But, how would lock file help? With AWS S3 it is not possible, you will run into the same issue - imagine both user clients will create a lock file. There is no atomic operation "create the file if it does not exist". Here is the consistency model of AWS S3. And the index "transactions" can only be implemented involving third-party mechanisms, like using a database for locking. If you see how to easily implement pluggable third-party consistency providers, please suggest.

I think the best solution at the moment for this plugin is to implement "reindex" action (as @markhu stated) that will regenerate repository index on demand. This will partly solve broken indexes, but of course race conditions will still be here.

So, what do you think? I see many requests for "reindex" operation from this plugin users, so I will try to implement it in the nearest weekend.

hypnoglow avatar Dec 14 '17 07:12 hypnoglow

Ref: #5

hypnoglow avatar Dec 14 '17 07:12 hypnoglow

Granted you can't 100% eliminate collisions if as you say, imagine both user clients will create a lock file at the same time. But we can still try to narrow the window of opportunity for race conditions. One improvement would be to delay processing index.yaml file after uploading the chart.tgz --here's a PR #19 for that.

markhu avatar Dec 14 '17 08:12 markhu

Ah, I (foolishly) didn't realize s3 bucket contents were eventually consistent. In that case, the only comprehensive solution would have to involve a service outside of s3 that implements a real mutex, which is probably beyond the scope of this plugin (although a low-overhead way to do it that I imagine would work for a lot of people would be with a git remote).

If the goal is to reduce the rate of occurrence, I'm guessing that s3's time-to-consistency is probably good enough, enough of the time that the lockfile approach would do have some impact – but I agree that it's not a full enough solution to be worth doing.

Reindexing is definitely better than nothing, and probably sufficient for almost everyone. 👍

thedoctor avatar Dec 14 '17 08:12 thedoctor

@thedoctor ,

although a low-overhead way to do it that I imagine would work for a lot of people would be with a git remote

I do not understand what do you mean, can you explain it and if it can be used to provide a consistent lock of the index file?

As for reindexing, yes, I think it will solve almost all cases of the broken index.

hypnoglow avatar Dec 15 '17 14:12 hypnoglow

Maybe he means store the index.yaml file in git instead of s3.

markhu avatar Dec 16 '17 00:12 markhu

So, I implemented the idea from #19 in #21

I've also created an issue for reindexing: #22

I think I will keep this current issue open for a while, maybe someone will come up with another idea to workaround the problem.

hypnoglow avatar Dec 19 '17 17:12 hypnoglow

Hey folks, I rolled out reindex command in 0.5.0. Let me know if it works well or not. And how well it suits your workflow, e.g. would you automate this, how would you automate reindexing (run X times per day, or run by a condition), or maybe manual occasional reindex is enough.

hypnoglow avatar Jan 09 '18 20:01 hypnoglow

Related: https://www.terraform.io/docs/backends/types/s3.html

josdotso avatar Jan 22 '18 21:01 josdotso

Hm, that's interesting, maybe we can implement pluggable third-party consistency providers like I thought above, with DynamoDB being the first. I will try to spare some time to investigate this.

hypnoglow avatar Feb 28 '18 21:02 hypnoglow

Another aproach could be to make it eventually consistent. When pushing the index.yml, helm s3 plugin could also push an index.yml diff file, like this:

apiVersion: v1
addedEntries:
  alpine:
    - created: 2016-10-06T16:23:20.499814565-06:00
      description: Deploy a basic Alpine Linux pod
      digest: 99c76e403d752c84ead610644d4b1c2f2b453a74b921f422b9dcb8a7c8b559cd
      home: https://k8s.io/helm
      name: alpine
      sources:
      - https://github.com/helm/helm
      urls:
      - https://technosophos.github.io/tscharts/alpine-0.2.0.tgz
      version: 0.2.0

Or for delete:

apiVersion: v1
removedEntries:
  alpine:
    - created: 2016-10-06T16:23:20.499814565-06:00
      description: Deploy a basic Alpine Linux pod
      digest: 99c76e403d752c84ead610644d4b1c2f2b453a74b921f422b9dcb8a7c8b559cd
      home: https://k8s.io/helm
      name: alpine
      sources:
      - https://github.com/helm/helm
      urls:
      - https://technosophos.github.io/tscharts/alpine-0.2.0.tgz
      version: 0.2.0

Whenever a helm s3 index update occures, it should do the following steps:

  1. Upload the diff file for the current index modification
  2. Fetch the current index.yml file
  3. Fetch all diff files from s3
  4. Apply all diff files (including the one uploaded in step 1) to index.yaml (use the chart digest for merging, ignore already applied changes)
  5. Upload new index.yml
  6. Delete old diff files (eg: older than 5 minutes), beause it's very unlikly that there will be any conflicting change before than

Fetching the index.yaml file with the s3 plugin could repeate step 2-4 for more consistency. Using the repo without the s3 plugin (as a static site) would still be a valid helm repo, but with weaker consistency.

There could be a command that repeats step 2-5, so a cron job could provide real eventual consistency.

masterada avatar Dec 15 '18 10:12 masterada

Hi, S3 objects have ETags, I'm just checking if there's an option to validate it on S3 side. If there is, the process would look like this:

  • dowload index with ETag
  • update index
  • upload new index with expected previous ETag If ETag does not match, upload is failed/rejected (meaning ETag does not match - someone/something changed index in the meantime) and client retries again (download, update, upload)

pete911 avatar Jul 05 '19 15:07 pete911

Just for the record s3 now supports object lock.

demo-ecout avatar Nov 14 '19 01:11 demo-ecout

Any updates ?

eserden avatar Mar 17 '21 10:03 eserden

Hi, @hypnoglow

Is this issue still around?

Thanks,

Balazs

balazs-fark-epam avatar Sep 05 '22 14:09 balazs-fark-epam