Enable multiple git revisions (Jun 2024) : final, using commit ID in repo localPath
This PR is spawned out of #4659, of which it represents its evolution and finalisation.
Some highlights:
- handles multiple revisions by storing them in
localPathdefined via thecommitId, for more accurate handling of branch updates - uses a local bare repository under the hood to both get revision->commit mapping and to pull required revisions
- stores revision->commit mapping into a dedicated
revisionMapfile, for accurate handling of user requested revisions vs revision update in remote repo - maintains original API for
AssetManagerclass constructor; a dedicated additional method,setRevisionAndLocalPath, has to be called after instantiation to setlocalPathandrevision - docs and unit tests updated
Some implementation notes:
- when dropping a revision, all revisions pointing to the same commit are also dropped
- repo directory layout:
projectNamedetermines location ofBARE_REPO,REVISION_MAPandREVISION_SUBDIRlocalPathis underREVISION_SUBDIRand requires revision-to-commit mapping ; for unit testing, redefine thelocalPathmanually
Some caveats:
- to be improved: handling of
revisionMapfile: I/O concurrency, refactor to a dedicated class (similar toScriptFile), others? - for now, default branch appear duplicated in revisionMap and hence in the
listoutput:- files are never duplicated thanks to
commitId-basedlocalPath - however running with no revision or with default revision may point to different commit IDs, if only one is updated to latest
- files are never duplicated thanks to
- in
AssetManager, some non-constructor methods have changed API, typically to get rid of the now un-neededrevisionargument- examples include
download()andclone() - it should be easy to restore these if needed
- examples include
- (minor) tedious Warning messaging (missing manifest) when pulling tags or non-existing revisions
- this behaviour is pre-existing to this PR (only when first pull of a project is done requesting a tag)
- it is seen more often, because every pull of a new revision is a fresh pull
@pditommaso for visibility.
Deploy Preview for nextflow-docs-staging canceled.
| Name | Link |
|---|---|
| Latest commit | e8e7c5bbb499cac4dd443786aa20bf055e874bf7 |
| Latest deploy log | https://app.netlify.com/projects/nextflow-docs-staging/deploys/6929baf75a3770000897b2fe |
Note: some commits were not signed-off (sigh) - leaving as is for now
Extra note on the method AssetManager.download() - corner case.
It is worthwhile doing some tests on what happens in the following corner case:
- both a revision and branch/commit are pulled, which are pointing to the same commit,
- then the branch gets new commits in the remote,
- then the pulls are updated.
It is worth testing against distinct pull orders:
- tag/commit -> branch -> [update remote branch] -> tag/commit
- branch -> tag/commit -> [update remote branch] -> tag/commit
If this code block works as expected, it should be all fine.
I will see if I can make the test soon, and in case will update this thread, otherwise I will leave it for later.
[UPDATE] done some testing, the code block mentioned above indeed does its job, hence this corner case is OK.
PS: one learning I made here is that, when a tag is requested, .resolve + .getName in JGit return a sort of aliased commit, i.e. a commit that does not exist in the repo history, but that points to the correct real commit that corresponds to the requested tag.
[UPDATE] this issue exists already with Nextflow stable release, it was not introduced by this PR.
Issue with a test repo which has lots of tags:
On first pull all good, however see how the commit is made to refer via distance from a tag, tag4~17:
$ NXF_HOME=$(pwd)/HOME_PR ./launch.sh pull -r 451ebd9dcb56045d80963945305811aa65f413d0 marcodelapierre/hello-nf
Checking marcodelapierre/hello-nf:451ebd9dcb56045d80963945305811aa65f413d0 ...
WARN: Cannot read project manifest -- Cause: Remote resource not found: https://api.github.com/repos/marcodelapierre/hello-nf/contents/nextflow.config?ref=451ebd9dcb56045d80963945305811aa65f413d0
downloaded from https://github.com/marcodelapierre/hello-nf.git - revision: 451ebd9dcb [tag4~17]
On repeated pull:
$ NXF_HOME=$(pwd)/HOME_PR ./launch.sh pull -r 451ebd9dcb56045d80963945305811aa65f413d0 marcodelapierre/hello-nf
Checking marcodelapierre/hello-nf:451ebd9dcb56045d80963945305811aa65f413d0 ...
Remote origin did not advertise Ref for branch refs/tags/tag4~17. This Ref may not exist in the remote or may be hidden by permission settings.
In good contrast, for a repo with not as many tags, all good:
$ NXF_HOME=$(pwd)/HOME_PR ./launch.sh pull -r 3650d7b8371c255a0a89ed5f2e07e2e5c1ed29ae hello
Checking nextflow-io/hello:3650d7b8371c255a0a89ed5f2e07e2e5c1ed29ae ...
downloaded from https://github.com/nextflow-io/hello.git - revision: 3650d7b8371c255a0a89ed5f2e07e2e5c1ed29ae
$ NXF_HOME=$(pwd)/HOME_PR ./launch.sh pull -r 3650d7b8371c255a0a89ed5f2e07e2e5c1ed29ae hello
Checking nextflow-io/hello:3650d7b8371c255a0a89ed5f2e07e2e5c1ed29ae ...
Already-up-to-date - revision: 3650d7b8371c255a0a89ed5f2e07e2e5c1ed29ae
I have updated the branch to master, resolving conflicts and updating some tests. I have also reviewed the branch.
Overall, it works well and keeps compatibility with older versions because it uses <project_path>.nextflow to store the bare repo and commits. Apart from the caveats commented in the PR, I have found the following ones:
-
Unnecessary full clones: For every
commitIdfolder, it is reusing the download method that is doing a full clone. There is no need to do a full clone, it should just include the branch to remove. I think it is because the default branch was difficult to get. However, in bare repos, it is store in the HEAD file and the reference is easily obtained by getting the HEAD ref of the bare repo. -
Use commit clones to get the list of Branches and Tags. Related to the previous comment, the
AssetManager.localPathproperty is set to<project_path>.nextflow/commits/<commitId>and the git repository used to get the list of branches and tags is using this folder. This information is also in the bare repo. No reason to get from localPath. -
Drop command has an issue with commits that are not in the revisions map file. For instance, an update of a branch that has been pulled before creates a kind of orphan folder with the previous commit id. When I try to remove it using the commit Id, drop fails.
-
Drop with -
alloption is iterating the revisions map, creating assets manager for each revision and removing one by one. Finally, it removes the whole project folder, so this iteration is unnecessary, we can direcly remote the directory
TODO:
- [x] Remove full clones in commit folders
- [x] Get HEAD, branches and tags from bare repo
- [x] Fix drop issues
- [x] Remove default branch duplication
- [x] fix tag problems
- [ ] Address concurrent pulls of the same revision (file lock?)
Fixed most of TODOs:
-
The tag issue mentioned https://github.com/nextflow-io/nextflow/pull/5089#issuecomment-2191578880 is produced by
git.repository.nameRevmethod that returns this strange<tag_name>~<number>when a commit is close to a tag. -
I think the revisionMap file where we keep the downloaded revision-to-commit could be removed as this can be done locally by querying the bare repo. The only part the we could be more complicated is knowing which revisions are downloaded in the
listcommand. But this information is not provided inmaster
Hola Jorge,
let me share a remark with you on the revisionMap file.
There is a difference between using the bare repo vs a self-managed revision map file in how updated revisions can be handled, with regards to user control:
- With the bare repo, every time the user requests to update a single revision, all of the other downloaded revisions will forcibly have an updated revision -> commit mapping;
- With a revision map file in the other hand, the user retains full control of which individual revisions -> commits maps to update.
Originally I went for option 2. with this PR, to give users more control. Ultimately 1. is also an option, as long as the behaviour is documented for the users.
Thanks for moving this forward!
@marcodelapierre. Thank you for the remark.
If I am not wrong, the update is caused by checkBareRepo that is fetching all branches/tags everytime that is invoked. I think it could be modifiy to the update to the requested revision by just fetching of the requested revision. So, both implementation could be qeuivalent and we shouldn't maintain the revisions map file.
In both cases, with different users sharing the assets cache, you can not ensure that another user has updated either the repo or the map from your last execution. As commented in other commits, the use of branches can not guarantee reproducibility. happening now.
I have also created an alternative way to create the commit clones. For each commit, it creates a repo in the commit folder where objects are shared with the Bare repo using the objects/info/alternates. It is the same as using the --shared option in the git clone command. With this option, there are no replicated git objects. It just replicates the worktree files and a minimum .git folder (~80K) that stores the references to the alternates, current branch, etc.
I have implemented it in a separate method. I am also wondering if it will be better to implement the AssetsManager changes as a subclass (such as BareAssetsManager) instead of directly modifying it. I am not sure which other components are using the AssetsManager. Implementing the changes as a subclass will not affect other components. @Ben, @pditommaso do you think it is a good approach? or do you prefer to keep it as a modification of the AssetsManager class as it is now
@marcodelapierre. Thank you for the remark. If I am not wrong, the update is caused by
checkBareRepothat is fetching all branches/tags everytime that is invoked. I think it could be modifiy to the update to the requested revision by just fetching of the requested revision. So, both implementation could be qeuivalent and we shouldn't maintain the revisions map file. In both cases, with different users sharing the assets cache, you can not ensure that another user has updated either the repo or the map from your last execution. As commented in other commits, the use of branches can not guarantee reproducibility. happening now.
Thanks a lot Jorge, I wasn't aware of options to update selected revisions in the bare repo, now it makes total sense to me! :)
Closing in favor of #6620