zoekt
zoekt copied to clipboard
Refactor zoekt-webserver flags parsing
This PR is a small flag parsing refactoring to start extracting functions from cmd/zoekt-webserver/main.go to a newwebserver package.
The underlying idea here is to make the creation of webserver-like binaries easier. This is something we're considering doing at GitLab for the Zoekt integration. That way we can reduce the amount of duplicate code between the binaries.
Please let us know what do you think. Thank you!
Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Dmitry Gruzd. This is most likely caused by a git client misconfiguration; please make sure to:
- check if your git client is configured with an email to sign commits
git config --list | grep email - If not, set it up using
git config --global user.email [email protected] - Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails
We require contributors to sign our Contributor License Agreement (CLA), and we don't have yours on file. In order for us to review and merge your code, please sign CLA to get yourself added.
Sourcegraph teammates should refer to Accepting contributions for guidance.
We require contributors to sign our Contributor License Agreement (CLA), and we don't have yours on file. In order for us to review and merge your code, please sign CLA to get yourself added.
Sourcegraph teammates should refer to Accepting contributions for guidance.
@DylanGriffith Could I ask you to review this PR and assign it to a maintainer if you're happy? Thanks!
@keegancsmith Thank you for taking a look!
There is already the
webpackage which exists for this purpose, I suspect a bit too much has grown into the zoekt-webservercommand?
That's my impression as well. I'd like to make the code responsible for the command itself much more compact and reusable.
What key things do you need at GitLab that we should maybe move to the web package instead?
I'd say it's almost all the code from cmd/zoekt-webserver/main.go, but a lot of it is quite command-specific, so maybe we should introduce another package. It could be webserver (like in this PR) or web_cmd.
Overall, these are our current draft plans:
- Add webserver mode to the
gitlab-zoekt-indexerbinary. That way we can ship one binary to self-managed customers, but still keep reads and writes as separate processes. For example,gitlab-zoekt -mode indexerandgitlab-zoekt -mode webserver. With the current webserver code, we'd need to copy a lot ofcmd/zoekt-webserver/main.go, which is not ideal from the maintainability perspective. - Introduce features that could potentially be added to the upstream (if they are useful for the wider community):
- JWT auth
- Add federated search to the webserver. Right now for global searches we spawn threads on GitLab's side to send requests to multiple Zoekt pods, but we would like to offload that to the webserver instead.
- Support subdirectories. We'd like to have one directory per
repoIDto allow efficient deletions. Probably withoutfsnotify, but with some kind of signaling from the indexer process to avoid exhaustinginotifylimits.
Please let me know what you think 🤝
There is already the
webpackage which exists for this purpose, I suspect a bit too much has grown into the zoekt-webservercommand?That's my impression as well. I'd like to make the code responsible for the command itself much more compact and reusable.
Ok great. Let me tackle this command a bit, it's been on the back of my mind for a while to clean this up since it has grown so much. We may end up in a place where you don't need to introduce your new package.
Is your main goal to have all the same flags? Do you actually expect at gitlab for people to change most of those values? If it is like the sourcegraph use case, the flags we specify are pretty much static.
- Add webserver mode to the
gitlab-zoekt-indexerbinary. That way we can ship one binary to self-managed customers, but still keep reads and writes as separate processes. For example,gitlab-zoekt -mode indexerandgitlab-zoekt -mode webserver. With the current webserver code, we'd need to copy a lot ofcmd/zoekt-webserver/main.go, which is not ideal from the maintainability perspective.
Cool, we have thought about this in the past as well to simplify the release pipeline.
- JWT auth
Not exactly sure if that is something we should have in upstream. I'd recommend if you need that you use the web.Server handler and wrap it to add in the authorization checks in your own binary.
- Add federated search to the webserver. Right now for global searches we spawn threads on GitLab's side to send requests to multiple Zoekt pods, but we would like to offload that to the webserver instead.
FYI we have code to do this in Sourcegraph but its quite tied to how we do service discovery/telemetry. Additionally the license for it right now is not permissively licensed, but going back in history to a time when it was may be a viable way to copy-paste :) At the end of the day the code is quite straightforward to do TBH, you can probably copy-paste the code used to aggregate across shards inside of the shards pkg.
- Support subdirectories. We'd like to have one directory per
repoIDto allow efficient deletions. Probably withoutfsnotify, but with some kind of signaling from the indexer process to avoid exhaustinginotifylimits.
I'm interested to learn more about what problems you are observing here. Do you have examples of repositories with large number of shards and have found an issue around deletion? At the end of the day though we have thought a bit about introducing more synchronization between indexer and webserver, but have always just decided to keep it simple.
@keegancsmith
Ok great. Let me tackle this command a bit, it's been on the back of my mind for a while to clean this up since it has grown so much. We may end up in a place where you don't need to introduce your new package.
Sounds great, thank you!
Is your main goal to have all the same flags? Do you actually expect at gitlab for people to change most of those values? If it is like the sourcegraph use case, the flags we specify are pretty much static.
We'll probably have our own flags. I've opened this PR to add some value by the refactoring (options struct) as well as start this discussion.
FYI we have code to do this in Sourcegraph but its quite tied to how we do service discovery/telemetry. Additionally the license for it right now is not permissively licensed, but going back in history to a time when it was may be a viable way to copy-paste :) At the end of the day the code is quite straightforward to do TBH, you can probably copy-paste the code used to aggregate across shards inside of the shards pkg.
I'll take a look 🤝
I'm interested to learn more about what problems you are observing here. Do you have examples of repositories with large number of shards and have found an issue around deletion? At the end of the day though we have thought a bit about introducing more synchronization between indexer and webserver, but have always just decided to keep it simple.
It started with this incident https://gitlab.com/gitlab-com/gl-infra/production/-/issues/17770. We had to reallocate a lot of repositories from one Zoekt node to another. That generated a ton of deletions. Each deletion needed to scan through more than 500K index files (filepath.Glob), which was quite slow and resource intensive.
We have some additional details like cloud profiling graphs from the time of the incident in this MR's description https://gitlab.com/gitlab-org/gitlab-zoekt-indexer/-/merge_requests/173. This MR should alleviate the pain, so in general I'd say that this problem is going to have lower priority after this fix is merged.
I'm pretty sure this problem only affects gitlab-zoekt-indexer since zoekt-sourcegraph-indexserver performs deletions/cleanup differently. We might also pursue other optimization avenues instead of adding subdirectories support.
@dgruzd @keegancsmith what's the latest on this PR? I'm doing some "spring cleaning" and going through all open PRs :)
@jtibshirani I’m inclined to close the PR, but let’s wait for @keegancsmith’s response 🤝.
@dgruzd I think we're good to close this out! Feel free to reopen or file a new issue if something changes.