feat: use caching for `directories` entries
Describe the problem/challenge you have
See discussion below: I changed this request to a "implement caching request"
Vendir is really slow if you clone multiple repos and sources.
Describe the solution you'd like
It would be nice to have a cache feature on entries in directories such that vendoring is faster
So in the below: vendir would cache the sources from &ref once and then the copying of the files for all entries would speed up tremendously.
apiVersion: vendir.k14s.io/v1alpha1
kind: Config
directories:
- path: ./bla
contents: &ref
- path: .
git:
url: https://gitlab.com/data-custodian/custodian.git
ref: main
depth: 1
newRootPath: tools
- path: /a
contents: *ref
- path: /b
contents: *ref
- path: /c
contents: *ref
- path: /d
contents: *ref
[!NOTE] Vendir should have a two step process, first download all sources (for which it does not have a cache entry) next distribute all files in the
directoriesentries. This will also make parallel processing trivial for the second step, parallel execution for the first step might be more tricky depending on how the download tools work.
Vote on this request
This is an invitation to the community to vote on issues, to help us prioritize our backlog. Use the "smiley face" up to the right of this comment to vote.
👍 "I would like to see this addressed as soon as possible" 👎 "There are other more important things to focus on right now"
We are also happy to receive and review Pull Requests if you want to help working on this issue.
I am trying to think if this might be a problem or not because vendir relies on underlying tools to do the download and such. Not sure if all commands that we are issuing can be done in parallel. Like git does execute ssh agent to load keys for git to use. Hmmm, this might be something that we need to think a bit about. @Zebradil what is your take here?
Short points:
- the problem exists and it'd be great to solve it
- the particular solution (its UX) seems to be more complicated that necessary
I think it should be sufficient to implement parallel processing and let user decide on degree of parallelism. I'd need to look through the code, but I think vendir already covers the topic of concurrent usage of its caches. Vendir needs to be able to detect same sources across targets in its supplied config (a target is a directory+content) and download a matching sources for all targets only once. All unique sources may be downloaded in parallel.
Below is some distantly related experience.
We (I and contributors of myks) already dealt with this a similar issue. In myks we use vendir under the hood for downloading sources for rendering kubernetes applications. In many cases, vendir sources are duplicated. For example, we may need to download a particular version of some helm chart 15 times. For us it wasn't possible to solve this challenge via vendir alone, because we already ran vendir in parallel for multiple applications.
To solve this, we do the following:
- process each target (directory+content) separately
- derive hash for every content configuration to be able to find duplicates
- download a source for each target separately, using locking by the source hash
- all content source configs must have
lazy: truefor this to work, so that vendir doesn't re-download already presented sources - link the downloaded source to its final destination, so that it's reused by multiple applications
I don't think this is directly applicable to vendir, but perhaps vendir can utilize cache more efficiently (I think there were issues with git sources: even if a repo is in the cache, refs are still fetched from remote which is slow for big repos, lik ArgoCD).
We also use a lot of the same sources to vendor from, to place files at different locations from this same source.
Using Yaml anchors makes this easy, but given vendir has no support for parallel/caching makes this still a paint to work with.
So to clarify: my parallel execution request basically is a "implement a cache system" request and is the first thing which should happen, Executing in parallel only means, that all downloading of new sources (with --locked of course) should be done in parallel but that can be tackled later.
@Zebradil, @joaopapereira I changed the title and description towards caching, as this is the problem at hand.
This issue is being marked as stale due to a long period of inactivity and will be closed in 5 days if there is no response.
Not stale