hound
hound copied to clipboard
`archive` pseudo-vcs driver: indexing code in archives (e.g. zip, tar) without extracting files
What kind of change does this PR introduce? (check at least one)
- [ ] Bugfix
- [x] Feature
- [ ] Code style update
- [ ] Refactor
- [ ] Build-related changes
- [ ] Other, please describe:
The PR fulfills these requirements:
- [X] All tests are passing?
- [X] New/updated tests are included?
- [ ] If any static assets have been updated, has ui/bindata.go been regenerated?
- [ ] Are there doc blocks for functions that I updated/created?
If adding a new feature, the PR's description includes:
- [X] A convincing reason for adding this feature (to avoid wasting your time, it's best to open a suggestion issue first and wait for approval before working on it)
Description:
This PR adds a new driver archive, which allows to index source code in archives (e.g. zip, tar; any that supported by https://github.com/mholt/archiver) without extracting files: while indexing, files are walked using archive API, and while searching, results are checked and snippets generated with files extracted on the fly.
A config example:
{
"dbpath" : "db",
"vcs-config" : {
"git": {
"ref" : "main"
}
},
"repos" : {
"video" : {
"url" : "/Volumes/1tb-ext4/twitch/video.zip",
"vcs" : "archive",
"vcs-config" : {
"ignored-files" : [".git"]
},
"url-pattern" : {
"base-url" : "file:///Volumes/1tb-ext4/src/twitch/{path}"
}
}
}
}
Some metrics:
- for 160 zip files, 126GB, I got 3GB of indexes
- it takes about 13 seconds for a search request to execute
@salemhilal would you mind to review the PR