nextflow icon indicating copy to clipboard operation
nextflow copied to clipboard

Enable multiple git revisions (Jun 2024) : final, using commit ID in repo localPath

Open marcodelapierre opened this issue 1 year ago • 4 comments

This PR is spawned out of #4659, of which it represents its evolution and finalisation.

Some highlights:

  • handles multiple revisions by storing them in localPath defined via the commitId, for more accurate handling of branch updates
  • uses a local bare repository under the hood to both get revision->commit mapping and to pull required revisions
  • stores revision->commit mapping into a dedicated revisionMap file, for accurate handling of user requested revisions vs revision update in remote repo
  • maintains original API for AssetManager class constructor; a dedicated additional method, setRevisionAndLocalPath, has to be called after instantiation to set localPath and revision
  • docs and unit tests updated

Some implementation notes:

  • when dropping a revision, all revisions pointing to the same commit are also dropped
  • repo directory layout:
    • projectName determines location of BARE_REPO, REVISION_MAP and REVISION_SUBDIR
    • localPath is under REVISION_SUBDIR and requires revision-to-commit mapping ; for unit testing, redefine the localPath manually

Some caveats:

  • to be improved: handling of revisionMap file: I/O concurrency, refactor to a dedicated class (similar to ScriptFile), others?
  • for now, default branch appear duplicated in revisionMap and hence in the list output:
    • files are never duplicated thanks to commitId-based localPath
    • however running with no revision or with default revision may point to different commit IDs, if only one is updated to latest
  • in AssetManager, some non-constructor methods have changed API, typically to get rid of the now un-needed revision argument
    • examples include download() and clone()
    • it should be easy to restore these if needed
  • (minor) tedious Warning messaging (missing manifest) when pulling tags or non-existing revisions
    • this behaviour is pre-existing to this PR (only when first pull of a project is done requesting a tag)
    • it is seen more often, because every pull of a new revision is a fresh pull

@pditommaso for visibility.

marcodelapierre avatar Jun 25 '24 15:06 marcodelapierre

Deploy Preview for nextflow-docs-staging canceled.

Name Link
Latest commit e8e7c5bbb499cac4dd443786aa20bf055e874bf7
Latest deploy log https://app.netlify.com/projects/nextflow-docs-staging/deploys/6929baf75a3770000897b2fe

netlify[bot] avatar Jun 25 '24 15:06 netlify[bot]

Note: some commits were not signed-off (sigh) - leaving as is for now

marcodelapierre avatar Jun 25 '24 15:06 marcodelapierre

Extra note on the method AssetManager.download() - corner case.

It is worthwhile doing some tests on what happens in the following corner case:

  • both a revision and branch/commit are pulled, which are pointing to the same commit,
  • then the branch gets new commits in the remote,
  • then the pulls are updated.

It is worth testing against distinct pull orders:

  • tag/commit -> branch -> [update remote branch] -> tag/commit
  • branch -> tag/commit -> [update remote branch] -> tag/commit

If this code block works as expected, it should be all fine.

I will see if I can make the test soon, and in case will update this thread, otherwise I will leave it for later.

[UPDATE] done some testing, the code block mentioned above indeed does its job, hence this corner case is OK.

PS: one learning I made here is that, when a tag is requested, .resolve + .getName in JGit return a sort of aliased commit, i.e. a commit that does not exist in the repo history, but that points to the correct real commit that corresponds to the requested tag.

marcodelapierre avatar Jun 26 '24 09:06 marcodelapierre

[UPDATE] this issue exists already with Nextflow stable release, it was not introduced by this PR.

Issue with a test repo which has lots of tags:

On first pull all good, however see how the commit is made to refer via distance from a tag, tag4~17:

$ NXF_HOME=$(pwd)/HOME_PR  ./launch.sh pull -r 451ebd9dcb56045d80963945305811aa65f413d0 marcodelapierre/hello-nf

Checking marcodelapierre/hello-nf:451ebd9dcb56045d80963945305811aa65f413d0 ...
WARN: Cannot read project manifest -- Cause: Remote resource not found: https://api.github.com/repos/marcodelapierre/hello-nf/contents/nextflow.config?ref=451ebd9dcb56045d80963945305811aa65f413d0
 downloaded from https://github.com/marcodelapierre/hello-nf.git - revision: 451ebd9dcb [tag4~17]

On repeated pull:

$ NXF_HOME=$(pwd)/HOME_PR  ./launch.sh pull -r 451ebd9dcb56045d80963945305811aa65f413d0 marcodelapierre/hello-nf

Checking marcodelapierre/hello-nf:451ebd9dcb56045d80963945305811aa65f413d0 ...
Remote origin did not advertise Ref for branch refs/tags/tag4~17. This Ref may not exist in the remote or may be hidden by permission settings.

In good contrast, for a repo with not as many tags, all good:

$ NXF_HOME=$(pwd)/HOME_PR ./launch.sh pull -r 3650d7b8371c255a0a89ed5f2e07e2e5c1ed29ae hello
Checking nextflow-io/hello:3650d7b8371c255a0a89ed5f2e07e2e5c1ed29ae ...
 downloaded from https://github.com/nextflow-io/hello.git - revision: 3650d7b8371c255a0a89ed5f2e07e2e5c1ed29ae

$ NXF_HOME=$(pwd)/HOME_PR ./launch.sh pull -r 3650d7b8371c255a0a89ed5f2e07e2e5c1ed29ae hello
Checking nextflow-io/hello:3650d7b8371c255a0a89ed5f2e07e2e5c1ed29ae ...
 Already-up-to-date - revision: 3650d7b8371c255a0a89ed5f2e07e2e5c1ed29ae

marcodelapierre avatar Jun 26 '24 12:06 marcodelapierre

I have updated the branch to master, resolving conflicts and updating some tests. I have also reviewed the branch.
Overall, it works well and keeps compatibility with older versions because it uses <project_path>.nextflow to store the bare repo and commits. Apart from the caveats commented in the PR, I have found the following ones:

  • Unnecessary full clones: For every commitId folder, it is reusing the download method that is doing a full clone. There is no need to do a full clone, it should just include the branch to remove. I think it is because the default branch was difficult to get. However, in bare repos, it is store in the HEAD file and the reference is easily obtained by getting the HEAD ref of the bare repo.

  • Use commit clones to get the list of Branches and Tags. Related to the previous comment, the AssetManager.localPath property is set to <project_path>.nextflow/commits/<commitId> and the git repository used to get the list of branches and tags is using this folder. This information is also in the bare repo. No reason to get from localPath.

  • Drop command has an issue with commits that are not in the revisions map file. For instance, an update of a branch that has been pulled before creates a kind of orphan folder with the previous commit id. When I try to remove it using the commit Id, drop fails.

  • Drop with -all option is iterating the revisions map, creating assets manager for each revision and removing one by one. Finally, it removes the whole project folder, so this iteration is unnecessary, we can direcly remote the directory

TODO:

  • [x] Remove full clones in commit folders
  • [x] Get HEAD, branches and tags from bare repo
  • [x] Fix drop issues
  • [x] Remove default branch duplication
  • [x] fix tag problems
  • [ ] Address concurrent pulls of the same revision (file lock?)

jorgee avatar Nov 26 '25 11:11 jorgee

Fixed most of TODOs:

  • The tag issue mentioned https://github.com/nextflow-io/nextflow/pull/5089#issuecomment-2191578880 is produced by git.repository.nameRev method that returns this strange <tag_name>~<number> when a commit is close to a tag.

  • I think the revisionMap file where we keep the downloaded revision-to-commit could be removed as this can be done locally by querying the bare repo. The only part the we could be more complicated is knowing which revisions are downloaded in the list command. But this information is not provided in master

jorgee avatar Nov 27 '25 15:11 jorgee

Hola Jorge, let me share a remark with you on the revisionMap file.

There is a difference between using the bare repo vs a self-managed revision map file in how updated revisions can be handled, with regards to user control:

  1. With the bare repo, every time the user requests to update a single revision, all of the other downloaded revisions will forcibly have an updated revision -> commit mapping;
  2. With a revision map file in the other hand, the user retains full control of which individual revisions -> commits maps to update.

Originally I went for option 2. with this PR, to give users more control. Ultimately 1. is also an option, as long as the behaviour is documented for the users.

Thanks for moving this forward!

marcodelapierre avatar Nov 28 '25 02:11 marcodelapierre

@marcodelapierre. Thank you for the remark. If I am not wrong, the update is caused by checkBareRepo that is fetching all branches/tags everytime that is invoked. I think it could be modifiy to the update to the requested revision by just fetching of the requested revision. So, both implementation could be qeuivalent and we shouldn't maintain the revisions map file. In both cases, with different users sharing the assets cache, you can not ensure that another user has updated either the repo or the map from your last execution. As commented in other commits, the use of branches can not guarantee reproducibility. happening now.

jorgee avatar Nov 28 '25 15:11 jorgee

I have also created an alternative way to create the commit clones. For each commit, it creates a repo in the commit folder where objects are shared with the Bare repo using the objects/info/alternates. It is the same as using the --shared option in the git clone command. With this option, there are no replicated git objects. It just replicates the worktree files and a minimum .git folder (~80K) that stores the references to the alternates, current branch, etc.

I have implemented it in a separate method. I am also wondering if it will be better to implement the AssetsManager changes as a subclass (such as BareAssetsManager) instead of directly modifying it. I am not sure which other components are using the AssetsManager. Implementing the changes as a subclass will not affect other components. @Ben, @pditommaso do you think it is a good approach? or do you prefer to keep it as a modification of the AssetsManager class as it is now

jorgee avatar Nov 28 '25 15:11 jorgee

@marcodelapierre. Thank you for the remark. If I am not wrong, the update is caused by checkBareRepo that is fetching all branches/tags everytime that is invoked. I think it could be modifiy to the update to the requested revision by just fetching of the requested revision. So, both implementation could be qeuivalent and we shouldn't maintain the revisions map file. In both cases, with different users sharing the assets cache, you can not ensure that another user has updated either the repo or the map from your last execution. As commented in other commits, the use of branches can not guarantee reproducibility. happening now.

Thanks a lot Jorge, I wasn't aware of options to update selected revisions in the bare repo, now it makes total sense to me! :)

marcodelapierre avatar Dec 01 '25 01:12 marcodelapierre

Closing in favor of #6620

bentsherman avatar Dec 03 '25 15:12 bentsherman