git-client-plugin icon indicating copy to clipboard operation
git-client-plugin copied to clipboard

[JENKINS-64383] combined refrepo became our bottleneck, support a fanout location too

Open jimklimov opened this issue 5 years ago • 13 comments

JENKINS-64383 - combined refrepo became our bottleneck

As detailed in the JIRA issue, our heavy use of a single combined reference repository made it more a bottleneck and cause of job timeouts than a speedup and reliability improvement which it once was. This PR explores a way to keep the single point of configuration of the reference repository directory, suffixed with some "magic variable" to substitute a path to subdirectory with a smaller-scope reference repository for a particular source Git URL. On file systems with symlinks it is possible to maintain several such names that would point to the same directory, for closely-related repositories or different URLs of the same repository.

This PoC introduces trivial support for reference repository paths ending with /${GIT_URL} to replace by url => funny dir subtree in filesystem. Its limitation at the moment is that the URL is pasted in verbatim - this works for Linux and Unix like systems that only forbid a 0x00 and a slash from being characters in a filename, and slash suits us as a directory subtree separator. This code likely won't run on Windows as is (colon in https: and likely other chars - Microsoft has an extensive list of invalid chars).

The next ideas, commented but not yet PoCed, are to either escape such characters (non-ASCII and offensive to at least one popular filesystem), or convert URLs into base64 strings or sha/md5/... hashes. Using submodules and finding a way to map several URLs to a certain submodule might be a good idea if they keep indexes separately. This all can be built on top of this PoCed code by introducing further suffixes and handling for them.

It was tested on a MultiBranch pipeline job, where an original definition of the reference repository was suffixed with the new magic string, yielding /home/abuild/jenkins-gitcache/${GIT_URL} (verbatim in "Advanced clone behaviours"). During the checkout into a wiped workspace, with this plugin variant installed:

Cloning the remote Git repository
Cloning repository https://github.com/zeromq/czmq.git
 > git init /dev/shm/jenkins-swarm-client/workspace/CZMQ-upstream_master # timeout=10
[WARNING] Parameterized reference path replaced with: /home/abuild/jenkins-gitcache/https://github.com/zeromq/czmq.git
Using reference repository: /home/abuild/jenkins-gitcache/https://github.com/zeromq/czmq.git
Fetching upstream changes from https://github.com/zeromq/czmq.git
 > git --version # timeout=10
 > git --version # 'git version 2.1.4'
 > git fetch --tags --progress https://github.com/zeromq/czmq.git +refs/heads/*:refs/remotes/origin/* # timeout=40

Avoid second fetch
Checking out Revision fbe313cd2010bace7833fe52d419f82282343bd9 (master)

 > git config remote.origin.url https://github.com/zeromq/czmq.git # timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config core.sparsecheckout # timeout=10
 > git checkout -f fbe313cd2010bace7833fe52d419f82282343bd9 # timeout=10

Commit message: "Merge pull request #2139 from bluca/ci_failures"
 > git rev-list --no-walk fbe313cd2010bace7833fe52d419f82282343bd9 # timeout=10

This completed quickly, much faster than the usual checkout with huge refrepo in original /home/abuild/jenkins-gitcache/, and did automatically find the "funny" /home/abuild/jenkins-gitcache/https://github.com/zeromq/czmq.git directory prepared with the single repo's reference cache:

# ls -la /home/abuild/jenkins-gitcache/https://github.com/zeromq/czmq.git
total 38
drwxr-xr-x 7 4294967294 4294967294   12 Dec  7 19:31 .
drwxr-xr-x 3 4294967294 4294967294    3 Dec  7 19:29 ..
-rw-r--r-- 1 4294967294 4294967294 2353 Dec  7 19:31 FETCH_HEAD
-rw-r--r-- 1 4294967294 4294967294   23 Dec  7 19:30 HEAD
drwxr-xr-x 2 4294967294 4294967294    2 Dec  7 19:30 branches
-rwxr--r-- 1 4294967294 4294967294  204 Dec  7 19:30 config
-rw-r--r-- 1 4294967294 4294967294   73 Dec  7 19:30 description
drwxr-xr-x 2 4294967294 4294967294   11 Dec  7 19:30 hooks
drwxr-xr-x 2 4294967294 4294967294    3 Dec  7 19:30 info
drwxr-xr-x 4 4294967294 4294967294    4 Dec  7 19:30 objects
drwxr-xr-x 5 4294967294 4294967294    5 Dec  7 19:31 refs
lrwxrwxrwx 1 4294967294 4294967294   43 Dec  7 19:30 register-git-cache.sh -> /mnt/jenkins-gitcache/register-git-cache.sh

DOCS NOTE: With 2.36.x and newer Git versions, if your reference repository maintenance script runs as a different user account than the Jenkins server (or Jenkins agent), safety checks about safe.directory (see https://github.blog/2022-04-18-highlights-from-git-2-36/) can be disabled by configuring each such user account:

:; git config --global --add safe.directory '*'

UPDATE: My repository at https://github.com/jimklimov/git-refrepo-scripts provides the shell scripts and Jenkinsfile jobs I use to maintain the servers using this modification of the Git Client plugin. One of the jobs there allows to automatically discover and register Git repositories used by known builds on the server it runs on (might run daily or so), and another can run more regularly to update the known refrepos.

jimklimov avatar Dec 07 '20 23:12 jimklimov

Status update:

The current version of this plugin PR is already bringing value in the experiments with the GIT_SUBMODULES token interpretation, although parsing the actual submodule data (.gitmodules) is not yet completed. In fact, that seems like a nice speed-up to get into the right directory if the needed repository is exactly the one named in the URL configured by the submodule definition, but is less helpful to co-hosting of forks of the same repository in same directory (the mode with still a combined repo with several remotes, but much smaller scopes to walk across much more relevant commit objects).

The "fallback" modes of just looking for subdirectories that are git repositories, and recursing into such to inspect the remotes' configurations there, is more I/O intensive but already works :)

On the unit-testing side, I'll probably deprecate the original GIT_URL token expansion (which just expects the original URL made into a directory tree on a local filesystem): while it was useful to get the feet wet, and "just worked" on illumos and Linux systems, it does not on Windows (as anticipated, with : being a reserved character), and more useful and portable tokens were designed since that PoC step so there's not much use to complicate matters into making this mode work everywhere by mangling paths or something. The GIT_URL_SHA256 already mangles it by hashing the normalized URL string as a subdirectory name, and GIT_URL_BASENAME strips the tree part and just relies on the final path component like "git-client-plugin(.git)" to match as a subdirectlry name in the refrepo.

jimklimov avatar Jan 15 '21 09:01 jimklimov

This solution is in production for one our CI farms for the past week or two, and works well for all of:

  • legacy jobs;
  • pipeline jobs defined "directly";
  • Organization Folders (GitHub and BitBucket) which is one place where we can set a fixed string for refrepo config, and that value is inherited into generated MultiBranch Pipeline jobs for each repo, and leaf pipelines for each branch, PR or tag;
  • several pipeline jobs that have explicit "checkout" steps to combine code from various repositories, inheriting common settings (including refrepo string) from a folder, and set by a string (build arg):
    scm_src_checkout_extensions_array << [$class: 'CloneOption', timeout: SRC_CHECKOUT_TIMEOUT,
        noTags: false, shallow: false, reference: "${PROJECT_GITCACHE}"]
  • a very legacy job with a shellscript-driven git checkout of numerous sub-repos for some analysis, that previously accepted a reference repository argument, proved not difficult to extend to handle the fanned-out repositories (picking a suitable smaller-scoped directory per repo URL)

For the past days I was updating https://github.com/jimklimov/git-scripts/blob/master/register-git-cache.sh to also handle this mode of shepherding the fanned-out refrepo repository more efficiently. Overall, an update of refrepo that took (with submodule discovery walks) close to a day recently with the monolith directory with hundreds of remotes set up, went down to a reasonable 3-5 minute range with hundreds of subdirectories and just few closely related remote URLs tracked in each.

jimklimov avatar Jan 31 '21 17:01 jimklimov

Hopefully nailed the Windows issue now... at least this output looks right (note: my local paths had to be redacted so sha256 does not match the strings you see):

...
=== Beginning to search for cloneDirName='clone.git' from C:\Users\jenkins\shared\git-client-plugin\.
Looking for 'target' in C:\Users\jenkins\shared\git-client-plugin\.
Looking for cloneDirName clone.git in C:\Users\jenkins\shared\git-client-plugin\.\target
FOUND cloneDirName clone.git in C:\Users\jenkins\shared\git-client-plugin\.\target
clone output:
======
[INFO] The git reference repository path is parameterized, it may take a few git queries logged below to resolve it into a particular directory name;
[WARNING] Parameterized reference path
  'C:\Users\jenkins\shared\git-client-plugin\target\refrepo256.git/${GIT_URL_SHA256}'
  replaced with:
  'C:\Users\jenkins\shared\git-client-plugin\target\refrepo256.git\3e56f2ff966ccc8ee3ac0df9faf4adfcd8ff38ef81cc062abd8bc9aae526dc91';

Using reference repository: C:\Users\jenkins\shared\git-client-plugin\target\refrepo256.git\3e56f2ff966ccc8ee3ac0df9faf4adfcd8ff38ef81cc062abd8bc9aae526dc91;

[junit625123998984275238] $ git remote -v
...

and in a more detailed case:

=== Beginning to search for cloneDirName='refrepo256.git/3e56f2ff966ccc8ee3ac0df9faf4adfcd8ff38ef81cc062abd8bc9aae526dc91' from C:\Users\jenkins\shared\git-client-plugin\.

Looking for 'target' in C:\Users\jenkins\shared\git-client-plugin\.

Looking for cloneDirName refrepo256.git/3e56f2ff966ccc8ee3ac0df9faf4adfcd8ff38ef81cc062abd8bc9aae526dc91
  in C:\Users\jenkins\shared\git-client-plugin\.\target

FOUND cloneDirName refrepo256.git/3e56f2ff966ccc8ee3ac0df9faf4adfcd8ff38ef81cc062abd8bc9aae526dc91
  in C:\Users\jenkins\shared\git-client-plugin\.\target

wsRefrepoBase='C:\Users\jenkins\shared\git-client-plugin\target\refrepo256.git'

wsRefrepo='C:\Users\jenkins\shared\git-client-plugin\target\refrepo256.git\3e56f2ff966ccc8ee3ac0df9faf4adfcd8ff38ef81cc062abd8bc9aae526dc91

reference='C:\Users\jenkins\shared\git-client-plugin\target\refrepo256.git/${GIT_URL_SHA256}'

url='C:\Users\jenkins\shared\git-client-plugin\target\clone.git'

urlNormalized='file://c:\users\jenkins\shared\git-client-plugin\target\clone'

Feb 22, 2021 4:24:29 PM org.jenkinsci.plugins.gitclient.LegacyCompatibleGitAPIImpl findParameterizedReferenceRepository

INFO: Trying to resolve parameterized Git reference repository
  'C:\Users\jenkins\shared\git-client-plugin\target\refrepo256.git/${GIT_URL_SHA256}'
  into a specific (sub-)directory to use for URL
  'C:\Users\jenkins\shared\git-client-plugin\target\clone.git' ...

reference after='C:\Users\jenkins\shared\git-client-plugin\target\refrepo256.git/3e56f2ff966ccc8ee3ac0df9faf4adfcd8ff38ef81cc062abd8bc9aae526dc91'


Feb 22, 2021 4:24:29 PM org.jenkinsci.plugins.gitclient.LegacyCompatibleGitAPIImpl findParameterizedReferenceRepository
INFO: After resolving the parameterized Git reference repository, decided to use
  'C:\Users\jenkins\shared\git-client-plugin\target\refrepo256.git/3e56f2ff966ccc8ee3ac0df9faf4adfcd8ff38ef81cc062abd8bc9aae526dc91'
  directory for URL 'C:\Users\jenkins\shared\git-client-plugin\target\clone.git'

So hopefully this PR will go out of "draft" mode when Jenkins confirms this too :)

jimklimov avatar Feb 22 '21 15:02 jimklimov

Disclaimer: I expect people better versed in Java would suggest other ways to solve some parts of this PR... Commits welcome, but I can't guarantee fixing code myself in language I don't understand too well still :) Also I'm getting more and more entangled in dayjob and other projects, so time is scarce and lags are long - so maybe later cleanup-PRs would work better.

jimklimov avatar Feb 22 '21 16:02 jimklimov

UPDATE: Relocated development the relevant files from a common mess in my git-scripts repo mentioned above, into https://github.com/jimklimov/git-refrepo-scripts dedicated to this subject.

By now the primary reasonable use-case (balancing the re-use of index for probably-related repos vs. independent storage of probably unrelated ones) is to suffix a literal /${GIT_SUBMODULES} to the reference repository path specified in the Advanced checkout (clone) behaviors or checkout([...]) pipeline step, and use the script above to (re-)populate that location. If you already have an older combined refrepo stored there, probably you can use it to speed up such re-population, though beware - by premise of this story it may be actually a slow-down compared to a new fetch from remote.

jimklimov avatar Mar 15 '21 21:03 jimklimov

That's odd: https://ci.jenkins.io/blue/organizations/jenkins/Plugins%2Fgit-client-plugin/detail/PR-644/96/pipeline reports no step/stage errors but ended up red

jimklimov avatar Jul 29 '21 08:07 jimklimov

Gentle bump, still works for us and helps a lot, still annoying to keep rolling private HPI builds at upgrade cycles =D

jimklimov avatar Jul 29 '21 08:07 jimklimov

And another gentle bump :)

jimklimov avatar Sep 08 '21 12:09 jimklimov

Faults in tests regarding git CLI tool messages do not seem related to this PR's changes. I'll add a fix to this branch for clarity, but it deserves to be a separate PR -- already posted as #886 (UPDATE: ...and #896) :)

jimklimov avatar Aug 01 '22 10:08 jimklimov

Thanks, I was afraid it would come to tests... for years :)

Just in case: how relevant would jgit and apache-jgit tests be? Do they have a concept of using reference repositories from local(ly seen) filesystem? I understand that can at least serve for non-regression - that new methods/logic would not explode with that git implementation, but is there more to it practically?

jimklimov avatar Sep 07 '22 08:09 jimklimov

Also while here, I've had a nagging thought in the back of my mind, that while the feature development offered several strategies for layout and discovery of fanned-out reference repositories (by URL path basenames, hash of URL, by submodules, etc.), in Jenkins setups I track I personally use one case that is best automated (with those helper scripts in my other project).

I am not convinced if the other strategies have benefits or downsides, just thought they could be useful (and were easier to implement for initial steps), or if some teams would prefer them over something known to work (maybe better, and automated almost out of the box) just as a human nuance :) Still, in some deployments with many similarly named repos there is a screenful of logs while it discovers the right subdirectory, so other methods might be quieter and faster if the cache area is groomed carefully.

So questions here would be:

  • Should the helper scripts from https://github.com/jimklimov/git-refrepo-scripts get somehow hosted at Jenkins Github (as resources in this plugin, or as some sidekick repo) for a better out-of-the-box experience?
    • Beside the script for managing reference repo fanout, it includes sample ready to use Jenkinsfile jobs to discover SCMs used by other job runs and so to register new Git sources with the shell script, and another to regularly update the cache. For Git CLI users it also includes a git-clone-rr command that takes advantage of such GIT_REFERENCE_REPO_DIR.
  • Should I rip out the other strategies or is it OK to keep them even if my own use does not stress them so I am not likely to encounter bugs (if any) IRL?
    • pros of ripping out - less code to review, test and maintain; less places to break
    • cons of ripping out - this functionality may be preferred by some teams and might be better optimized (so arguably a bug or lack of feature in git-refrepo-scripts which does not maintain the fan-out location in a manner friendly and optimal for those strategies)

And one thing to note, I've probably stressed it before - git-client plugin runs on an agent ("built-in", some swarm/ssh agent on same machine as the controller, or truly remote), so each such agent should have access to refrepo location(s).

  • To benefit from the reference repositories during clone operations (hm, is fetch/pull impacted?), each such machine which directly runs the git operation should have (or see over NFS/CIFS/bind-mounts/...) the reference repository directory. This is quite doable, technical details vary with agent tech and CI farm trade-offs involved.
  • If each worker checks out code independently of others, they may have refrepos at different locations (or absent altogether) and announced by agent-wide environment variables or other in-job settings.
  • If you copy (stash/unstash) workspaces from worker to worker, you might need to decouple them from a reference repository so they are usable on another host that does not see the same instance of the refrepo at same pathname. One example can be seen here: https://github.com/networkupstools/jenkins-dynamatrix/blob/ef0111c8c672ed860c9d5293c257582358810f40/src/org/nut/dynamatrix/DynamatrixStash.groovy#L386
  • Otherwise, git clone operations done on a machine that does not see a refrepo at all are benign (absence is noted, refrepo feature not used, full-size checkout from original source should happen).

jimklimov avatar Sep 07 '22 09:09 jimklimov

Recent re-run of tests https://github.com/jenkinsci/git-client-plugin/pull/644/checks?check_run_id=8950094793 failed with a cool run-time issue (unrelated to PR contents):

org.jenkinsci.plugins.gitclient.CliGitAPIImplTest.testSubmoduleUpdateWithThreads

Cloning into bare repository 'C:\Jenkins\workspace\Plugins_git-client-plugin_PR-644\.\target\clone-744732476378843854'...
fatal: unable to access 'https://github.com/jenkinsci/git-client-plugin/': Could not resolve host: github.com
 expected:<0> but was:<128>

jimklimov avatar Oct 18 '22 10:10 jimklimov