open-vsx.org icon indicating copy to clipboard operation
open-vsx.org copied to clipboard

ensure SLO for server availability

Open akosyakov opened this issue 3 years ago • 3 comments

Please consider implement request duration and failure rate metrics for OpenVSX server to ensure availability. In our experience RED metrics are good fit for this. @amvanbaren suggested to use spring-metrics to collect data for prometheus.

At Gitpod we rely on OpenVSX server responsiveness while users starting workspaces. If a request to OpenVSX fails then workspace is mostly unusable since VS Code frontend times out in 1 min. We have been working on SLO of 99% of extensions availability and built a caching proxy which allows us to serve 70%-90% of requests for 3 days while OpenVSX is down.

But it is not enough to achieve the goal though. We need to ensure that the issue gets recognised and addressed in OpenVSX itself before users notice it. In the past it was not a case, i.e. https://www.eclipsestatus.io/ usually did not get updated before some Gitpod user ping us and then we reach out to @eclipsewebmaster. Usually we already have a full blown incident by this moment. Unfortunately it is tricky for us to figure out whether there is a real issue with upstream from the proxy, since we are not only client and a request failure can be caused by the proxy itself. The OpenVSX server looks to be a proper place to address the issue.

akosyakov avatar Oct 26 '21 07:10 akosyakov

I believe you filed this before many server updates were peformed. Since then, service availability and responsiveness have greatly improved, but there is still a lot of work to be done, as open-vsx consumes an inordinate amount of bandwidth. I will file an issue for that.

eclipsewebmaster avatar Nov 29 '21 16:11 eclipsewebmaster

I believe you filed this before many server updates were peformed. Since then, service availability and responsiveness have greatly improved, but there is still a lot of work to be done, as open-vsx consumes an inordinate amount of bandwidth. I will file an issue for that.

The performance improvements are great. But the issue is about Eclipse team being able to recognise that incident is happening. In the past it was never the case. It is even alright for us that it takes a day or two to resolve the incident, but it should be noticed before users do it.

akosyakov avatar Nov 30 '21 12:11 akosyakov

Related PR: https://github.com/eclipse/openvsx/pull/667

amvanbaren avatar Jan 31 '23 10:01 amvanbaren