sparse index support
Quick recap: What is a sparse index?
Instead of containing one entry for every file in the worktree ("regular" index structure), a sparse index only contains a subset of these. Additionaly, it contains entries to directories that are marked with the SKIP_WORKTREE flag. All files within these entries can be skipped by functions that read / update the index and thereby increase performance.
If the index file contains the the "Sparse Directory Entries" extension marked by the signature sdir, it is classified as a sparse index.
Motivation
The goal of this issue is to keep track of the requirements necessary to eventually fully integrate sparse index support for gitoxide.
This issue does not yet contain all the tasks and considerations by any means, but the goal is to add new knowledge and keep everything up to date as I go along and things become more clear.
Tasks
- [x] reading
- [x] "regular" index with files containing the SKIP_WORKTREE flag
- [x] sparse index with directories containing the SKIP_WORKTREE flag, in cone mode
- [x] write specific tests to verify those behaviours
- [x] #563
- [ ] Tree extension order in gitoxide is different than in git, prevents raw byte comparisons
- [ ] configure index version via
write::Options
- [x] find out what options in git-config influence / configure sparse index related tasks to better understand what capabilities are needed
- [x] update
gix progresswith those findings
- [x] update
- [x] #635
- [ ] implement functionality similar to
ensure_full_index()- scan index for sparse directory entries (trees) and expand them into a full list of filepaths (regular index structure), mutating the current index
State, for use in subsequent functions that don't support working with sparse indexes yet - find out where and how it make sense to use that function
- scan index for sparse directory entries (trees) and expand them into a full list of filepaths (regular index structure), mutating the current index
- [ ] matching logic of
.git/info/sparse-checkoutfor cone mode- [ ] cone mode
- [ ] no-cone mode (inverted .gitignore) this functionality is deprecated in git
- [ ] restore DIR information during writing or as separate step as indicated here
- [ ] command similar to `git sparse-checkout set / add
- [ ] support
--coneand--no-coneflags
- [ ] support
Notes
- the
git sparse-checkout set / addcommands modify the list of files contained in.git/info/sparse-checkout, which uses the same syntax as a.gitignorefile. Cone mode and non-cone mode decide how this file gets interpreted. Cone mode will match only directories while non-cone mode will use the same matching logic used for.gitignorefiles. read more - non-cone mode and sparse index are incompatible with eachother
that makes sense because sparse indexes mark entire directories as
SKIP_WORKTREEwhich is what cone-mode matches on, while non-cone mode can also match on single files which does not give an advantage to the amount of entries in the index - non-cone mode is now deprecated
References
Thanks so much for setting up this tracking issue and taking the lead on this! I can't wait to see more and more of these boxes ticked.
inverted .gitignore matching logic does this need to be supported with non-cone mode being deprecated?
I think it's OK to focus on cone mode but fail gracefully in non-cone mode from day one. From there we can decide if it's worth maintaining non-cone mode as well, probably based on people actually requesting it to be supported.
For posterity, since I keep finding myself puzzled about what states sparse indices exist in, here is an analysis in code that sums it up.
Interesting bits of this recently added technical document
- rename 'sparse directory' to 'skipped directory - I think we should do that too.
- reading about partial clones and automatic on-demand downloads of packs makes me afraid of all the added complexity that will be needed to handle all of that.
- this portion makes me think that within
gitoxide, probably theRepositoryinstance, there should be settings for how sparsity should affect operations to be adjustable on a case-by-case basis. partial cloneis mentioned multiple times, and I think there is a lot that I don't know about that.- I absolutely think that turning off dynamic downloads/partial clones on demand is going to be part of a first implementation to support fully offline use of git (everything else seems like a 'perversion'), which is a bit different from what
gitplans to do - whether or not commands see the sparse index as sparse should ultimately be configurable to ideally handle this VFC (behaviour C) usecase as well.
- loving this oversimplified listing of behaviours and what they mean, along with the
A*part of not auto-downloading objects. - really helpful to have a list of commands that need to be sparsity-aware.
- a nice summary of command-behaviours based on an analysis if all git commands
- it's interesting to learn that merges operate on all files and thus might conflict and temporarily 'vivify' these conflicts into the worktree despite otherwise being skipped.
- there is nice list with suggestions on how to name these 'sparsity' related flags on commands
- The Known Bugs section is probably good to learn what to avoid early on, or the traps that implementing sparsity correctly might contain. It's also good to know for the time when we try to learn from
gitcode, and wonder why it doesn't handle things we think it should handle - some likeread-treedon't do it correctly when sparsity is involved. We should do better from day one. - On the mailing list there is a nice sample repo from which to build a test-case that has terrible performance characteristics with some git operations. Can we one day run this and see what happens?