hound icon indicating copy to clipboard operation
hound copied to clipboard

`archive` pseudo-vcs driver: indexing code in archives (e.g. zip, tar) without extracting files

Open muravjov opened this issue 1 year ago • 1 comments

What kind of change does this PR introduce? (check at least one)

  • [ ] Bugfix
  • [x] Feature
  • [ ] Code style update
  • [ ] Refactor
  • [ ] Build-related changes
  • [ ] Other, please describe:

The PR fulfills these requirements:

  • [X] All tests are passing?
  • [X] New/updated tests are included?
  • [ ] If any static assets have been updated, has ui/bindata.go been regenerated?
  • [ ] Are there doc blocks for functions that I updated/created?

If adding a new feature, the PR's description includes:

  • [X] A convincing reason for adding this feature (to avoid wasting your time, it's best to open a suggestion issue first and wait for approval before working on it)

Description:

This PR adds a new driver archive, which allows to index source code in archives (e.g. zip, tar; any that supported by https://github.com/mholt/archiver) without extracting files: while indexing, files are walked using archive API, and while searching, results are checked and snippets generated with files extracted on the fly.

A config example:

{
  "dbpath" : "db",
  "vcs-config" : {
    "git": {
      "ref" : "main"
    }
  },
  "repos" : {
    "video" : {
      "url" : "/Volumes/1tb-ext4/twitch/video.zip",
      "vcs" : "archive",
      "vcs-config" : {
        "ignored-files" : [".git"]
      },
      "url-pattern" : {
        "base-url" : "file:///Volumes/1tb-ext4/src/twitch/{path}"
      }
    }
  }
}

Some metrics:

  • for 160 zip files, 126GB, I got 3GB of indexes
  • it takes about 13 seconds for a search request to execute

muravjov avatar May 21 '24 22:05 muravjov

@salemhilal would you mind to review the PR

muravjov avatar Jun 01 '24 21:06 muravjov