copier
copier copied to clipboard
perf(updating): avoid creating subproject copy
I've performed an internal change to the update algorithm to avoid creating a subproject copy, which turns out to be a performance regression introduced by #1587.
As of #1587, the update algorithm computes a diff between the dirty HEAD of the subproject's Git repository – dirty because of potentially staged files during a pre-migration task – and a fresh copy based on the old version of the template. Before this PR, a full copy of the subproject directory (excluding the .git/ directory) was created and the copy was initialized as a new Git repository with one commit comprising all files. For large subprojects, creating a subproject copy incurs non-negligible overhead as reported in #1709.
This PR takes a different approach that does not involve copying the subproject; in fact, it does not even involve fetching shallow copies of local Git repositories. First, a reference of the dirty HEAD of the subproject (after the pre-migration task has run) is obtained by calling git write-tree, which creates a tree object without a commit and returns a reference. Also, the HEAD SHA of the temporary Git repository containing a fresh copy from the new template is determined. Then, the diffs involving the dirty HEAD of the subproject or the fresh copy of the new template are computed by referring to these object references. A trick for comparing local Git repositories makes computing the diffs efficient by avoiding fetching shallow copies from local repositories.
Honestly, I'm not super happy with this solution because the update algorithm is getting more and more complex and it feels like we're increasingly stretching the limits of Git. But it keeps the test suite passing while avoiding the subproject copy. :shrug:
WDYT?
Fixes #1709.
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 97.75%. Comparing base (
c685192) to head (19d2a42). Report is 83 commits behind head on master.
Additional details and impacted files
@@ Coverage Diff @@
## master #1711 +/- ##
==========================================
+ Coverage 97.61% 97.75% +0.13%
==========================================
Files 49 49
Lines 5042 5068 +26
==========================================
+ Hits 4922 4954 +32
+ Misses 120 114 -6
| Flag | Coverage Δ | |
|---|---|---|
| unittests | 97.75% <100.00%> (+0.13%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Well, it's working except on Windows. I'll investigate tomorrow.
Now, Windows is happy, too. I missed the Windows-specific alternate object directory separator in the Git documentation (; for Windows, : for Unix): https://git-scm.com/docs/git#Documentation/git.txt-codeGITALTERNATEOBJECTDIRECTORIEScode
This is ready for a critical review now. :balloon:
Honestly, I'm not super happy with this solution because the update algorithm is getting more and more complex and it feels like we're increasingly stretching the limits of Git. But it keeps the test suite passing while avoiding the subproject copy. 🤷
Yep, not great for future maintainers, but what other option do we have :confused:? Git is pretty well documented, and you commented the code well too, so I guess it's OK.
I've rebased this PR onto the tip of master and factored out enhancing the git command with additional object directories into a separate function, as we need it more than once now.
Hi,
There is also a regression in 9.3.0, 9.3.1, copier update fails if there is a directory in the destination with no read/write access for the current user (on linux). (because of copytree)
This PR is solves it.
I've added a test that involves updating a project with a separate Git directory. To make this test pass, I've replaced the manual construction of the Git objects directory path by a subprocess call of git that returns the correct path also when a separate Git directory is used.
While working on this fix, I gained a better understanding of Git object sharing, mainly from these resources:
- https://git-scm.com/docs/gitrepository-layout#Documentation/gitrepository-layout.txt-objectsinfoalternates
- https://docs.gitlab.com/ee/development/git_object_deduplication.html
I had been a little unhappy about using the environment variable GIT_ALTERNATE_OBJECT_DIRECTORIES and introducing a helper function that constructs this environment variable and assigns it to a Plumbum git command. Then, I realized that Git object sharing can be configured also by adding paths to other repositories' objects directories to $GIT_DIR/objects/info/alternates. The result is the same, but the semantics of configuring a Git repository with alternate object stores feels conceptually more sound to me than calling git with a particular environment variable.
So, the new implementation involves configuring alternate object directories in the temporary repository created by the update algorithm. Then, a regular git diff-tree command works with object references borrowed from linked repositories.
Oh, Windows isn't working – again. I'll check what to do about it.
It's working on Windows now, too. Git is sensitive to newline characters in the .git/objects/info/alternates file; it must be the \n character on both Linux/macOS and Windows, not \r\n on Windows. I hadn't realized that Path(...).write_text("\n") writes \r\n on Windows, that's what was keeping the Windows CI jobs from passing.
I think this PR is ready for another review.
Thanks again for bringing up the --separate-git-dir case and for confirming it's working now, @hparfr. :pray:
In essence, instead of fetching copies of other Git repositories, their object stores are linked (via entries in the .git/objects/info/alternates file) which avoids duplicating objects. Then, a foreign object can be referenced by its object ID. This only works for local repositories, but Copier's update algorithm operates on local repositories only (after cloning the old and new template versions), so all good.
Thanks for the positive feedback. :pray: I'll press merge now.