zoekt wip: Create delta shard using diff between on disk index and upstream

Progress on https://github.com/sourcegraph/sourcegraph/issues/37063 DISCLAIMER: In its current state this PR is a POC. It is not feature complete. It lacks appropriate testing. Its goal is to elicit feedback and guide future development efforts!

All feedback is welcome. Particularly feedback which discusses the approach which I have outlined in detail below!

Background context

The foundational work to support incremental indexing resulted in massively decreased indexing times for larger repositories as only the changed files need to be reindexed. However it still requires shallow cloning all files in the repository, including those which have not changed. For large repositories this means substantially more time is spent cloning than performing the index. As such, reducing the time spent cloning by only downloading the changed files would cut the time down massively

Approach

The approach taken in this POC is to use git diffs where the flag -U is used to include the entire file in the diff context (as well as some other options, outlined in cmd/zoekt-index-diff/main.go to make things easier).

For each file:

Parsing the diff allows us to calculate both the original and new file contents as well as the old and new git blob hashes.
The original and new file path allow us to determine whether the file has been added/removed/modified
- Added if original filename is /dev/null
- Removed if new filename is /dev/null
- Modified in all other cases

For added files:

Check for any documents with the same filename for this repo in a shard
- Of those check if any document has the same content as the new content from diff
  - If so add current branch to document and insert document into new shard
  - Otherwise this must be a unique version of the file so create a new document for it

For removed files:

Check for any documents with the same filename for this repo in a shard
- Of those check if any document has the same content as the original content from diff (new content is blank - it's been removed)
  - If so remove current branch from document and insert document into new shard
  - Otherwise the file was unique so no document needs to be updated/added to new shard

For modified files:

Parse the old & new git blob hashes for the file
Check for any documents with the same filename for this repo in a shard
- Compute the git blob hashes for each document
- Check if any document hash matches the old version of the file
  - If so remove current branch from document and insert document into new shard
  - Otherwise the file was unique so no document needs to be updated/added to new shard
- Check if any document hash matches the new version of the file
  - If so add current branch to document and insert document into new shard
  - Otherwise this must be a unique version of the file so create a new document for it

That is the general overview of the approach. In all cases any other documents which matched the Check for any documents with the same filename for this repo in a shard but which weren't the one we were looking for are re-added to the new shard for file tombstone consistency.

The current implementation lacks some important checks

Verify old commit we are diffing against is actually what is on disk
- Otherwise there will be a mismatch during comparisons and the shard could end up in a bad state
Handle skipped/ignored files

DocReader

As reading existing documents is required by this POC I created DocReader as a very simple way of extracting documents directly out of shards without having to go through the Searcher interface. DocReader has enough functionality to get the job done for this POC and not much more. If a similar indexing method as outlined in this POC is further developed; DocReader like functionality should probably be developed first in a separate PR.

Drawbacks

Aside from the missing functionality mentioned above this general approach has a major drawback. It does not support multi-branch indexing. If there are changes on multiple branches they will need to be handled sequentially. It should be possible to adapt this approach by a not significant amount to support multi-branch indexing but that was out of scope for this POC

Test Plan

I added some tests to verify some functionality around the diffs
Lacking proper testing around the core flow added/removed/modified docs
- In its current form probably needs a refactor to be a bit more testable
Manually verified that the shards appeared correct after creating a delta shard

If you would like to play around with it I would recommend the following:

Add a couple branches to git config which sourcegraphFake will read #393 for indexing. It's easier for testing if one of these is branch you have checked out.

git config add zoekt.branch <branch>
git config add zoekt.branch <branch2>

Index the repo with zoekt-sourcegraph-indexserver

Make a change to a file and commit it.

Run the following modifying REPO_ID and REPO_NAME, and anything. These can be found in plaintext at the end of an existing shard in the metadata section.

git diff HEAD~..HEAD --full-index -U9999999 --no-prefix --no-renames | go run ./cmd/zoekt-index-diff -id "<REPO_ID>" -name "<REPO_NAME>" -branch "$(git rev-parse --abbrev-ref HEAD)" -commit "$(git rev-parse HEAD)"

Jul 15 '22 22:07 jac

new content is blank - it's been removed

it is valid to have empty files indexed. Would this break anything?

For modified files

Could you simplify this by just pretending a modified document has been both removed and added?

DocReader

Brilliant. I think this is a good candidate to be pulled out into another PR to minimize the size of this PR / reduce burden if this PR has multiple review rounds. But let me first review this PR, maybe we won't have a bunch of back and forth :)

Jul 18 '22 12:07 keegancsmith

new content is blank - it's been removed

it is valid to have empty files indexed. Would this break anything?

That's fine. We don't determine if it has been deleted based on if it is empty. I was just highlighting that we compare the original content against the on disk documents

For modified files

Could you simplify this by just pretending a modified document has been both removed and added?

It would require the logic to be tweaked a little. Currently, in the case of added/removed docs we iterate over all the documents with a matching name looking for a match. For all the non-matches they get added back to the builder as we iterate. The matching document (always in case of deletion, not always for addition as it could be a unique version) has its branches updated accordingly.

For a modified file for the deletion step we wouldn't be able to add the non-matching documents back to the builder straight away as the new content could match one of those documents. The old & new version would have to be compared at every step of the iteration in order to maintain correct document branches.

For addition & deletion currently comparison is done using bytes.Equal whereas modification uses git blob hash lookups in a map to avoid iterating over every document twice. This is something that would be worth benchmarking to see if it would be faster to use git blob hashes instead of bytes.Equal for addition & deletion. If so this would then lead nicely into being able to pretend a modified document has been both removed and added like you suggest

DocReader

Brilliant. I think this is a good candidate to be pulled out into another PR to minimize the size of this PR / reduce burden if this PR has multiple review rounds.

That's the plan. I needed the functionality to make this POC work but it should definitely be done in a separate PR if we are to go ahead with this

Jul 19 '22 14:07 jac

@jac I am guessing there is no active development on this right now. Shall we close this PR for now?

Feb 14 '23 12:02 stefanhengl

zoekt zoekt copied to clipboard

wip: Create delta shard using diff between on disk index and upstream

Background context

Approach

DocReader

Drawbacks

Test Plan

zoekt
zoekt copied to clipboard