mx icon indicating copy to clipboard operation
mx copied to clipboard

mx gate always clones full history of dependencies

Open fniephaus opened this issue 7 years ago • 22 comments

I'm running mx gate on my CI infrastructure and my project depends on graal which has grown a lot recently. Consequently, it takes quite some time to clone the entire history. It'd be great to have an option to tell mx to not clone the full history. I found this related TODO in the code. Any objections to an option, like --git-depth=1 or --fast-clone?

Best, Fabio

fniephaus avatar Apr 21 '18 18:04 fniephaus

Hello, I assume you use a normal mx suite import in suite.py with a revision like https://github.com/oracle/truffleruby/blob/533f430d8a2eed49390442ab303996894e4e24a7/mx.truffleruby/suite.py#L7-L15 In that case, I think it's not easy to know which git --depth would be needed as the imported revision might be quite behind master and it wouldn't resolve if --depth is not large enough.

eregon avatar Apr 21 '18 20:04 eregon

@eregon that's right. However, it's also possible to fetch just one specific commit (see here).

fniephaus avatar Apr 21 '18 20:04 fniephaus

That doesn't seem to work for me (trying commit 56081af2d974b1650181f3d8496e1dd90a6a583b of mx):

$ mkdir mx
$ cd mx
$ git init 
Initialized empty Git repository in /home/eregon/mx/.git/
$ git remote add origin https://github.com/graalvm/mx.git
$ git fetch origin 56081af2d974b1650181f3d8496e1dd90a6a583b
error: Server does not allow request for unadvertised object 56081af2d974b1650181f3d8496e1dd90a6a583b
zsh: exit 128

And it seems very unreliable from the comments on StackOverflow as well unfortunately.

git clone --depth 1 --branch BRANCH_OR_TAG works well IIRC, but only with branches and tags, not arbitrary commits SHA-1.

eregon avatar Apr 21 '18 21:04 eregon

@eregon oh, you're right. Well, maybe it'd be good enough to support branches and tags the way you mentioned as well as GitHub. Instead of cloning, mx could just download: https://github.com/{repo-slug}/archive/{commit}.zip

Example: https://github.com/graalvm/mx/archive/56081af2d974b1650181f3d8496e1dd90a6a583b.zip

fniephaus avatar Apr 21 '18 21:04 fniephaus

For branches it should be as easy as:

diff --git a/mx.py b/mx.py
index a69e55e..d876123 100755
--- a/mx.py
+++ b/mx.py
@@ -6298,7 +6298,7 @@ class GitConfig(VC):
             cmd += ['--no-checkout', '--shared', '--origin', 'cache', '-c', 'gc.auto=0', '-c', 'remote.cache.fetch=+refs/remotes/' + hashed_url + '/*:refs/remotes/cache/*', '-c', 'remote.origin.url=' + url, cache]
         else:
             if branch:
-                cmd += ['--branch', branch]
+                cmd += ['--depth', '1', '--branch', branch]
             if self.object_cache_mode:
                 cache = self._local_cache_repo()
                 log("Fetch from " + url + " into cache " + cache)

do you think there are cases where using --depth 1 in combination with --branch <branch> could cause an issue?

zakkak avatar Feb 22 '19 03:02 zakkak

@fniephaus you may use MX_GIT_CACHE as well to speedup your CI pipelines.

  MX_GIT_CACHE          Use a cache for git objects during clones.
                         * Setting it to `reference` will clone repositories using the cache and let them
                           reference the cache (if the cache gets deleted these repositories will be
                           incomplete).
                         * Setting it to `dissociated` will clone using the cache but then dissociate the
                           repository from the cache.
                         * Setting it to `refcache` will synchronize with server only if a branch is
                           requested or if a specific revision is requested which does not exist in the
                           local cache. Hence, remote references will be synchronized occasionally. This
                           allows cloning without even contacting the git server.
                        The cache is located at `~/.mx/git-cache`.

zakkak avatar Feb 22 '19 05:02 zakkak

I don't understand how MX_GIT_CACHE would help speeding up CI builds on public infrastructure that doesn't keep state between builds. MX_GIT_CACHE uses a cache on disk and on a fresh system, that cache always needs to be downloaded first. Cloning from scratch is probably faster than that or am I missing something?

fniephaus avatar Feb 22 '19 14:02 fniephaus

Just in case, double check if your CI infrastructure allows for caching content e.g. https://docs.travis-ci.com/user/caching/

boris-spas avatar Feb 22 '19 16:02 boris-spas

@fniephaus yes if caching is not possible on your CI setup, then MX_GIT_CACHE doesn't make sense. For Maxine we are using Jenkins and docker agents. To make sure we don't fetch all the dependencies for every job we are mounting the host ~/.mx to ~/.mx on the docker containers. With MX_GIT_CACHE we are able to further decrease Job times by ~2mins.

zakkak avatar Feb 22 '19 19:02 zakkak

@boris-spas I know that TravisCI supports caching...but like I said before, build performance is unlikely to improve because of it. It's only useful when caching stuff that takes some time to compute (e.g. custom built third party libraries). Not cloning the entire commit history, on the other hand, does improve the time to git clone.

fniephaus avatar Feb 24 '19 06:02 fniephaus

In the meantime, I came up with this workaround (cloning only the required commit before running mx). This seems to save approx. 3min per build job on Travis CI.

fniephaus avatar Nov 15 '19 08:11 fniephaus

@boris-spas I know that TravisCI supports caching...but like I said before, build performance is unlikely to improve because of it. It's only useful when caching stuff that takes some time to compute (e.g. custom built third party libraries). Not cloning the entire commit history, on the other hand, does improve the time to git clone.

@fniephaus MX_GIT_CACHE is only useful to cache dependencies defined in your mx suite, so it doesn't have to fetch them all over again. If in your mx suite you define graal as a dependency and you use MX_GIT_CACHE (along with CI caching), then new builds won't need to clone graal every time. Not sure how well it works with dependencies where versions change often (and thus the cached repo needs to be updated).

In the meantime, I came up with this workaround (cloning only the required commit before running mx). This seems to save approx. 3min per build job on Travis CI.

Thanks for sharing.

zakkak avatar Nov 15 '19 10:11 zakkak

I noticed as well that cloning graal in TravisCI takes a very long time (250s = 4min10s): https://travis-ci.org/oracle/truffleruby/builds/618237461

I think it would be useful to integrate @fniephaus's workaround in mx, what do you think @dougxc ?

eregon avatar Nov 28 '19 17:11 eregon

So this just does a shallow clone? That would cause problems for mx operations on imported suites that need a full history. I don't know off the top of my head if the Travis gate includes such operations.

dougxc avatar Nov 28 '19 17:11 dougxc

This could just be an option (e.g. --shallow-clone) in case one doesn't need the full history for cloned suites.

fniephaus avatar Nov 28 '19 17:11 fniephaus

Ok, sounds reasonable.

dougxc avatar Nov 28 '19 17:11 dougxc

Unfortunately, the git fetch origin --depth 1 "${TRUFFLE_COMMIT}" approach only works if the commit in question corresponds to a tag or a branch, at least for GitHub:

$ git init
Initialized empty Git repository in /home/eregon/tmp/empty/graal1/.git/
$ git remote add origin https://github.com/oracle/graal.git

# Current import of Truffle in TruffleRuby
$ git fetch origin --depth 1 aee967f1d90e1f032b266c62d10fc8c32a805fb8
error: Server does not allow request for unadvertised object aee967f1d90e1f032b266c62d10fc8c32a805fb8
zsh: exit 128   git fetch origin --depth 1 aee967f1d90e1f032b266c62d10fc8c32a805fb8

# Current master at the time of writing:
$ git fetch origin --depth 1 cdcc1dbd9d5370b86b9fcf66401110d1b69d783e
remote: Enumerating objects: 12972, done.
remote: Counting objects: 100% (12972/12972), done.
remote: Compressing objects: 100% (7919/7919), done.
remote: Total 12972 (delta 6320), reused 6281 (delta 3284), pack-reused 0
Receiving objects: 100% (12972/12972), 13.83 MiB | 8.58 MiB/s, done.
Resolving deltas: 100% (6320/6320), done.
From https://github.com/oracle/graal
 * branch            cdcc1dbd9d5370b86b9fcf66401110d1b69d783e -> FETCH_HEAD
$ du -hs .git
15M	.git

So that approach doesn't seem different than git clone --branch vm-19.3.0 --depth 1 https://github.com/oracle/graal.git, except that it allows passing a commit (of a branch or tag) instead of a branch or tag name.

The size of .git of graal after a full clone is 1.1G nowadays, which I guess the main reason why it is slow to clone.

eregon avatar Nov 30 '19 10:11 eregon

Alternatively, it'd also be possible to download a zip from GitHub, for example https://github.com/oracle/graal/archive/cdcc1dbd9d5370b86b9fcf66401110d1b69d783e.zip.

fniephaus avatar Nov 30 '19 10:11 fniephaus

Alternatively, it'd also be possible to download a zip from GitHub, for example https://github.com/oracle/graal/archive/cdcc1dbd9d5370b86b9fcf66401110d1b69d783e.zip.

Interesting, there are also .tar.gz and they're only ~12MB in size (the .zip around 25MB). But of course those don't include the .git, which a few things might rely on.

Here is a quick try to let mx build work without the git history: https://github.com/graalvm/mx/compare/master...eregon:support-no-vc

eregon avatar Nov 30 '19 10:11 eregon

Here is another idea, based on https://stackoverflow.com/a/43926596/388803 and git fetch --shallow-since=DATE. For the date, we could use the last time the graal import was modified in suite.py, if always updating to latest graal, or get a few days of margin if not.

That seems quite fast, 7s + 9s (instead of 243s = 4min) for:

git clone --depth 1 https://github.com/oracle/graal.git ../graal
git -C ../graal fetch --shallow-since=05/11/2019

https://travis-ci.org/eregon/truffleruby/builds/618889445

eregon avatar Nov 30 '19 10:11 eregon

FWIW, I tried a git gc --aggressive in the graal repository:

$ du -hs .git
1.1G	.git
$ git gc --aggressive 
...
22:04.69 total
$ du -hs .git
446M	.git

So if we could convince GitHub to do something similar it might help quite a bit too. git gc on the other hand did not help.

eregon avatar Nov 30 '19 10:11 eregon

In GitHub Actions ubuntu-latest: git clone https://github.com/oracle/graal.git (or mx sforceimports) takes 2min 36s.

actions/checkout is a huge help there:

    - name: Clone Graal
      uses: actions/checkout@v2
      with:
        repository: oracle/graal
        path: graal
        fetch-depth: 0 # unlimited
    - run: mv graal ..

takes 17 seconds.

eregon avatar Jan 14 '20 18:01 eregon