gitoxide icon indicating copy to clipboard operation
gitoxide copied to clipboard

sparse index support

Open SidneyDouw opened this issue 3 years ago • 3 comments

Quick recap: What is a sparse index?

Instead of containing one entry for every file in the worktree ("regular" index structure), a sparse index only contains a subset of these. Additionaly, it contains entries to directories that are marked with the SKIP_WORKTREE flag. All files within these entries can be skipped by functions that read / update the index and thereby increase performance.

If the index file contains the the "Sparse Directory Entries" extension marked by the signature sdir, it is classified as a sparse index.

Motivation

The goal of this issue is to keep track of the requirements necessary to eventually fully integrate sparse index support for gitoxide.

This issue does not yet contain all the tasks and considerations by any means, but the goal is to add new knowledge and keep everything up to date as I go along and things become more clear.

Tasks

  • [x] reading
    • [x] "regular" index with files containing the SKIP_WORKTREE flag
    • [x] sparse index with directories containing the SKIP_WORKTREE flag, in cone mode
    • [x] write specific tests to verify those behaviours
  • [x] #563
    • [ ] Tree extension order in gitoxide is different than in git, prevents raw byte comparisons
    • [ ] configure index version via write::Options
  • [x] find out what options in git-config influence / configure sparse index related tasks to better understand what capabilities are needed
    • [x] update gix progress with those findings
  • [x] #635
  • [ ] implement functionality similar to ensure_full_index()
    • scan index for sparse directory entries (trees) and expand them into a full list of filepaths (regular index structure), mutating the current index State, for use in subsequent functions that don't support working with sparse indexes yet
    • find out where and how it make sense to use that function
  • [ ] matching logic of .git/info/sparse-checkout for cone mode
    • [ ] cone mode
    • [ ] no-cone mode (inverted .gitignore) this functionality is deprecated in git
  • [ ] restore DIR information during writing or as separate step as indicated here
  • [ ] command similar to `git sparse-checkout set / add
    • [ ] support --cone and --no-cone flags

Notes

  • the git sparse-checkout set / add commands modify the list of files contained in .git/info/sparse-checkout, which uses the same syntax as a .gitignore file. Cone mode and non-cone mode decide how this file gets interpreted. Cone mode will match only directories while non-cone mode will use the same matching logic used for .gitignore files. read more
  • non-cone mode and sparse index are incompatible with eachother that makes sense because sparse indexes mark entire directories as SKIP_WORKTREE which is what cone-mode matches on, while non-cone mode can also match on single files which does not give an advantage to the amount of entries in the index
  • non-cone mode is now deprecated

References

SidneyDouw avatar Oct 20 '22 07:10 SidneyDouw

Thanks so much for setting up this tracking issue and taking the lead on this! I can't wait to see more and more of these boxes ticked.

inverted .gitignore matching logic does this need to be supported with non-cone mode being deprecated?

I think it's OK to focus on cone mode but fail gracefully in non-cone mode from day one. From there we can decide if it's worth maintaining non-cone mode as well, probably based on people actually requesting it to be supported.

Byron avatar Oct 21 '22 02:10 Byron

For posterity, since I keep finding myself puzzled about what states sparse indices exist in, here is an analysis in code that sums it up.

Byron avatar Nov 22 '22 07:11 Byron

Interesting bits of this recently added technical document

  • rename 'sparse directory' to 'skipped directory - I think we should do that too.
  • reading about partial clones and automatic on-demand downloads of packs makes me afraid of all the added complexity that will be needed to handle all of that.
  • this portion makes me think that within gitoxide, probably the Repository instance, there should be settings for how sparsity should affect operations to be adjustable on a case-by-case basis.
  • partial clone is mentioned multiple times, and I think there is a lot that I don't know about that.
  • I absolutely think that turning off dynamic downloads/partial clones on demand is going to be part of a first implementation to support fully offline use of git (everything else seems like a 'perversion'), which is a bit different from what git plans to do
  • whether or not commands see the sparse index as sparse should ultimately be configurable to ideally handle this VFC (behaviour C) usecase as well.
  • loving this oversimplified listing of behaviours and what they mean, along with the A* part of not auto-downloading objects.
  • really helpful to have a list of commands that need to be sparsity-aware.
  • a nice summary of command-behaviours based on an analysis if all git commands
  • it's interesting to learn that merges operate on all files and thus might conflict and temporarily 'vivify' these conflicts into the worktree despite otherwise being skipped.
  • there is nice list with suggestions on how to name these 'sparsity' related flags on commands
  • The Known Bugs section is probably good to learn what to avoid early on, or the traps that implementing sparsity correctly might contain. It's also good to know for the time when we try to learn from git code, and wonder why it doesn't handle things we think it should handle - some like read-tree don't do it correctly when sparsity is involved. We should do better from day one.
  • On the mailing list there is a nice sample repo from which to build a test-case that has terrible performance characteristics with some git operations. Can we one day run this and see what happens?

Byron avatar Nov 25 '22 10:11 Byron