kart icon indicating copy to clipboard operation
kart copied to clipboard

Updating workdir index is slow for large repos, has no progress output

Open olsen232 opened this issue 1 year ago • 1 comments

We use git to update an index for the "workdir" - the filesystem working copy. Git is good at detecting changes to files in an efficient way, which is what we need to do for kart diff / kart status. (Git can quickly decide if a file is unchanged by comparing its mtime and other stats, and if the mtime is changed, hashing the file and comparing this to the hash in the index).

On the initial checkout, we tell git to make an index from all the files in the workdir. It correctly applies the LFS filter (for those files that match the LFS pattern), then puts the hashes of the LFS pointer files into the index (and the pointer files themselves into the ODB). So, things work pretty similar to how they do in any other Git LFS repo.

We could use pygit2 to write the index, do it more manually, and get progress reporting, but... pygit2 doesn't understand filters (such as the LFS filter). But, if we were writing code in Kart to manually build up a pygit2 index, we could add code there to apply the filter. Harder to work around is that pygit2 Index API will only let you write (path, filemode, hash) tuples - it doesn't expose the extra stats such as mtime. Without this optimisation, kart status and kart diff will take ages since all the files in the workdir have to be hashed to see if anything has changed. This might need changes in both pygit2 and libgit2 to fix.

If we do stop using git to write the index - and use kart + pygit2 code instead - we could also make the initial checkout more efficient since we already know the sha256 hashes for the LFS tiles, so we shouldn't need to hash them again to get the pointer files - but this is presumably what Git is spending a lot of time doing when we tell it "make an index from these files".

olsen232 avatar Jun 26 '23 09:06 olsen232