vendir icon indicating copy to clipboard operation
vendir copied to clipboard

feat: use caching for `directories` entries

Open gabyx opened this issue 4 months ago • 7 comments

Describe the problem/challenge you have

See discussion below: I changed this request to a "implement caching request"

Vendir is really slow if you clone multiple repos and sources.

Describe the solution you'd like

It would be nice to have a cache feature on entries in directories such that vendoring is faster

So in the below: vendir would cache the sources from &ref once and then the copying of the files for all entries would speed up tremendously.

apiVersion: vendir.k14s.io/v1alpha1
kind: Config
directories:
 
  - path: ./bla
    contents: &ref
      - path: .
        git:
          url: https://gitlab.com/data-custodian/custodian.git
          ref: main
          depth: 1
        newRootPath: tools
  
  
  - path: /a
    contents: *ref

  - path: /b
    contents: *ref

  - path: /c
    contents: *ref

  - path: /d
    contents: *ref

[!NOTE] Vendir should have a two step process, first download all sources (for which it does not have a cache entry) next distribute all files in the directories entries. This will also make parallel processing trivial for the second step, parallel execution for the first step might be more tricky depending on how the download tools work.


Vote on this request

This is an invitation to the community to vote on issues, to help us prioritize our backlog. Use the "smiley face" up to the right of this comment to vote.

👍 "I would like to see this addressed as soon as possible" 👎 "There are other more important things to focus on right now"

We are also happy to receive and review Pull Requests if you want to help working on this issue.

gabyx avatar Oct 23 '25 08:10 gabyx

I am trying to think if this might be a problem or not because vendir relies on underlying tools to do the download and such. Not sure if all commands that we are issuing can be done in parallel. Like git does execute ssh agent to load keys for git to use. Hmmm, this might be something that we need to think a bit about. @Zebradil what is your take here?

joaopapereira avatar Oct 24 '25 20:10 joaopapereira

Short points:

  • the problem exists and it'd be great to solve it
  • the particular solution (its UX) seems to be more complicated that necessary

I think it should be sufficient to implement parallel processing and let user decide on degree of parallelism. I'd need to look through the code, but I think vendir already covers the topic of concurrent usage of its caches. Vendir needs to be able to detect same sources across targets in its supplied config (a target is a directory+content) and download a matching sources for all targets only once. All unique sources may be downloaded in parallel.

Below is some distantly related experience.

We (I and contributors of myks) already dealt with this a similar issue. In myks we use vendir under the hood for downloading sources for rendering kubernetes applications. In many cases, vendir sources are duplicated. For example, we may need to download a particular version of some helm chart 15 times. For us it wasn't possible to solve this challenge via vendir alone, because we already ran vendir in parallel for multiple applications.

To solve this, we do the following:

  • process each target (directory+content) separately
  • derive hash for every content configuration to be able to find duplicates
  • download a source for each target separately, using locking by the source hash
  • all content source configs must have lazy: true for this to work, so that vendir doesn't re-download already presented sources
  • link the downloaded source to its final destination, so that it's reused by multiple applications

I don't think this is directly applicable to vendir, but perhaps vendir can utilize cache more efficiently (I think there were issues with git sources: even if a repo is in the cache, refs are still fetched from remote which is slow for big repos, lik ArgoCD).

Zebradil avatar Oct 24 '25 21:10 Zebradil

We also use a lot of the same sources to vendor from, to place files at different locations from this same source. Using Yaml anchors makes this easy, but given vendir has no support for parallel/caching makes this still a paint to work with.

gabyx avatar Oct 25 '25 09:10 gabyx

So to clarify: my parallel execution request basically is a "implement a cache system" request and is the first thing which should happen, Executing in parallel only means, that all downloading of new sources (with --locked of course) should be done in parallel but that can be tackled later.

gabyx avatar Oct 25 '25 09:10 gabyx

@Zebradil, @joaopapereira I changed the title and description towards caching, as this is the problem at hand.

gabyx avatar Oct 25 '25 09:10 gabyx

This issue is being marked as stale due to a long period of inactivity and will be closed in 5 days if there is no response.

github-actions[bot] avatar Dec 05 '25 00:12 github-actions[bot]

Not stale

Zebradil avatar Dec 05 '25 00:12 Zebradil