registry Observability and alerting infrastructure

To-do's:

[ ] Identify what services we need to set up for this (& who will fund them)
[ ] Implement

Any alerting system should pipe to some shared community resource, e.g. perhaps the Steering Committee Discord.

We'll rely on volunteer maintainers to review alerts as available. Because guidance to consumers of the API is to do a daily ETL, we don't need a particularly rigorous SLA or on-call rotation.

May 27 '25 23:05 tadasant

From a services POV, here are the service components that we need

A http web server that's hosts the go server code
A mongoDB database ( or some other equivalent DB/storage host)
An internal cache (either Memcache or Redis or equivalent?)
?

Jun 09 '25 02:06 sridharavinash

@sridharavinash for this ticket, I was moreso referring to things like:

[ ] What service(s) will we use to capture and expose our application's logs?
[ ] What service(s) will we use to surface anomalies, and where will such anomalies get raised?

What you are laying out is probably better fit for this ticket: https://github.com/modelcontextprotocol/registry/issues/91

Jun 09 '25 16:06 tadasant

ah yeah you're right, I got my issues cross wired :D! .

What service(s) will we use to capture and expose our application's logs?

I wonder if whatever hosting option that we go to will dictate some of these?

Jun 09 '25 19:06 sridharavinash

@tadasant @sridharavinash Issue belongs to domain I am closely working with. Have some questions to understand this better.

It is mentioned that application logs as one of the observability data, are we looking to capture the server(registry process) issues mostly, logs works great for such cases but not for all use cases as it blows up pretty quickly as per the throughput and is one of the most expensive telemetry data after traces.
- Logs will primarily help to find the root cause of any issue in server.
Are there any plan to expose metrics? That will help in getting overall server level telemetry data on whole which might be very useful for clients to monitor, trigger/schedule periodic run on the basis of metric values and make product decisions accordingly. Metrics are most cheapest one and might add value for the use cases that registry is trying to solve.
- From point of view of running anomalies from consumer point or maintainer point of view it can be done using metrics to isolate the impact of any bigger issue.

Just trying to understand and learn, if there is any gap in my thought process, I am more than happy to understand this better.

Aug 05 '25 10:08 pree-dew

We haven't made plans on the scope of what we plan to capture and store here. My basic view is that we want some logging system and some ability to surface errors and custom alerts, ultimately piped into the MCP Contributor Discord, just so we know that things aren't breaking in production.

The idea of other metrics (e.g. to measure feature usage) would be nice but not a p0.

Aug 06 '25 12:08 tadasant

@tadasant makes sense. I would be more than happy to take care of implementation part of it, if it is okay.

Aug 07 '25 03:08 pree-dew

Would probably be a fan of the stack:

OpenTelemetry metrics and logs collection
Ship these to Prometheus + Grafana + Loki

Although really I'm wondering how crucial this is. A scrappy version we could get away with might be:

Stdout logs, which will be automatically picked up by K8s.
Maintainers use kubectl logs -f <pod> as a break glass if we need to look at the logs. But this should be very rare.
If/when we need more monitoring then we expand this - feels like something that's fairly straightforward to add later on.

Aug 08 '25 00:08 domdomegg

I think a scrappy version of logs like you are suggesting is fine for launch, agree.

I do think we should have some sort of alerting system in place; just don't want to be in a situation where we think we're live and working but reality is, everyone's hitting errors.

Aug 08 '25 04:08 tadasant

I do think we should have some sort of alerting system in place; just don't want to be in a situation where we think we're live and working but reality is, everyone's hitting errors.

The really jank version of this is angry people on Discord/GitHub doing this for us 😄

But yeah I think some basic API success rate monitoring is probably good. Maybe @pree-dew can pick up doing this with OpenTelemetry as it sounded like they were interested in helping here and have OT experience? If they have capacity to do this in say the next ~week.

Aug 08 '25 09:08 domdomegg

@domdomegg Happy to take this up, I am a fan of OpenTelemetry :) Just reiterating my understanding here:

We want to start with basic logs first
Have alerting focused on some high level metrics like API success rate?

Aug 08 '25 09:08 pree-dew

@pree-dew yep I think that's correct. I think you'd be primarily focusing on item 2 there, of exporting out some of those high level metrics (feel free to also think through what metrics we need, happy to discuss here :)).

Aug 09 '25 01:08 domdomegg

@domdomegg Went through the codebase, I think to start with we can have standard RED metrics.

For publish, list and getAll endpoint, we can start with throughput, error rate and duration
We can have a standard up like metric, highlighting the status of the service.

If we are okay with above metrics, I will add details of corresponding attributes/labels and metric type(gauge/counter/histogram) for each before implementation for review

Few questions:

If we need duration, what kind of SLA we are looking as it will help me deciding the right buckets for duration buckets (want to start with fewer buckets)
Related to design: Approach 1: Follow the approach as it is done for auth, registry and config. - Add another called telemetry and pass it to RegisterV0Routes - pass telemetry to corresponding handler to record the metrics

Approach 2: Have a separate struct that holds common utils like telemetry(metrics and logs) etc. - Define handlers as method on the struct - Each handler will have accessing to all common utils.

Which approach you will prefer? Open to any other approach also.

Looking forward to your feedback.

Aug 13 '25 17:08 pree-dew

@pree-dew yep I'm happy with those metrics, they sound good. To your questions:

I don't think we have a strict SLA here, and I think we can be fairly lax. My guess is that things are going wrong if: read latency is >2s or publish latency >10s. But also happy for you to choose whatever numbers you think are sensible here - I presume it's pretty easy to change them up later?
I think approach 1 is probably good. Ideally we could actually use some kind of middleware to collect the RED metrics, so we don't have boilerplate in every API function. And just to confirm, ultimately these would get exposed as prometheus or OpenTelemetry compatible metrics?

Aug 15 '25 14:08 domdomegg

@domdomegg Makes sense.

Yes, we can change it later. Since it is getting used for internal alerting purpose, we can tweak the alerts accordingly as per buckets if we change it in future. We have to take care that the latency(duration) metric should include both endpoint requirements as buckets are configurable at metric level.
Yes, will instrument it using opentelemetry-go SDK with prometheus exporter, exposed for pull mechanism on /metrics. If the labels required for each api is same then we can definitely try middleware approach(will evaluate it more). I will start with the approach and try to keep it simple. :)

Happy to accommodate any feedback.

Aug 15 '25 22:08 pree-dew

@pree-dew that sounds excellent, keen for you to raise a PR and tag me for a review :)

Aug 18 '25 15:08 domdomegg

@domdomegg Pushed changes here https://github.com/modelcontextprotocol/registry/pull/290, can you please review? Happy to improve it.

Aug 20 '25 17:08 pree-dew

Awesome, have merged that PR. I think the next step is asking 'now we're collecting metrics, how do we turn that into actionable alerts'.

Possibly we want to set up a prometheus/grafana/alertmanager/loki stack? (should be fairly easy with our deploy infra, did something almost identical to this in the ~same way we could here) Or maybe there's something more lightweight/preferable?

@pree-dew do you have thoughts on what might be the best solution for us here? I think you have a lot of valuable experience thinking about observability and alerting :)

Aug 21 '25 13:08 domdomegg

Also maybe @claude, what would your suggestions be for how we should set up alerts? Read the docs and related issues to get context on the project to tailor your response, and highlight the important trade-offs/decision criteria you think we should consider.

Aug 21 '25 13:08 domdomegg

Claude finished @domdomegg's task —— View job

I'll analyze the observability setup and provide alerting recommendations for the MCP Registry project.

Todo List

[x] Review current observability implementation from PR #290
[x] Analyze project constraints and requirements
[x] Research alerting stack options
[x] Provide recommendations with trade-offs

Current Observability State

Looking at the merged PR #290, you already have solid OpenTelemetry metrics instrumentation in place:

✅ What's Already Implemented:

OpenTelemetry metrics with Prometheus exporter exposed at /metrics endpoint
RED metrics (Rate, Error, Duration) for HTTP endpoints:
- mcp_registry_http_requests_total - request counter with method/path/status attributes
- mcp_registry_http_request_duration - duration histogram with sensible buckets (5ms to 50s)
- mcp_registry_http_errors_total - error counter (4xx/5xx responses)
- mcp_registry_service_up - service health gauge
Middleware automatically instruments all API endpoints except health/metrics/ping/docs
Clean separation of telemetry concerns in internal/telemetry/

Alerting Solution Recommendations

Given your project constraints (volunteer maintainers, no strict SLA, community-driven), here are the options ranked by complexity vs value:

Option 1: Lightweight Discord Webhook Alerts (Recommended)

What it is: Simple monitoring service that scrapes /metrics and sends Discord webhooks for critical issues.

Pros:

Minimal infrastructure overhead
Direct integration with MCP Contributor Discord
Can be deployed as a sidecar container in your existing K8s setup
Very low cost/maintenance
Easy to iterate on alert rules

Cons:

Limited alerting sophistication
No fancy dashboards or historical analysis
Requires custom monitoring service

Implementation approach:

Small Go service that periodically scrapes /metrics endpoint
Simple threshold-based alerting (error rate > X%, latency > Y seconds, service down)
Discord webhook integration for notifications
Deploy as sidecar in your existing Pulumi K8s setup

Option 2: Prometheus + Grafana + AlertManager Stack

What it is: Full observability stack as @domdomegg suggested, similar to the bluedot setup.

Pros:

Industry standard, battle-tested
Rich alerting capabilities and dashboard visualization
Good for long-term growth if you want detailed analytics
AlertManager can route to Discord via webhooks

Cons:

More infrastructure to maintain (3+ additional services)
Higher resource usage and complexity
Overkill for current needs
Requires Prometheus expertise for rule configuration

Implementation approach:

Add to your existing Pulumi deployment
Prometheus scrapes /metrics endpoint
Grafana for dashboards (optional initially)
AlertManager with Discord webhook receiver

Option 3: Managed Solutions (Pragmatic Middle Ground)

What it is: Use managed monitoring services with free tiers.

Examples:

Grafana Cloud (free tier: 10k metrics, 50GB logs)
Prometheus monitoring services (various providers)
UptimeRobot for basic uptime monitoring

Pros:

No infrastructure maintenance
Professional alerting capabilities
Often have generous free tiers
Can still route to Discord

Cons:

External dependency
May hit limits as you scale
Less customization

My Recommendation: Start with Option 1

For the MCP Registry's current stage, I'd recommend Option 1 (lightweight Discord webhook alerts) because:

Matches your constraints: Simple, low-maintenance, volunteer-friendly
Leverages existing work: Your metrics are already perfect for this approach
Quick to implement: Could be done in a day or two
Easy to evolve: Can migrate to Option 2 later if needed

Sample Alert Rules to Start With

Based on your current metrics, here are suggested initial alerts:

# Critical alerts (immediate Discord notification)
- Service Down: mcp_registry_service_up == 0
- High Error Rate: (rate(mcp_registry_http_errors_total[5m]) / rate(mcp_registry_http_requests_total[5m])) > 0.1
- High Latency: histogram_quantile(0.95, rate(mcp_registry_http_request_duration_bucket[5m])) > 10

# Warning alerts (batched notifications)
- Moderate Error Rate: error rate > 0.05
- Elevated Latency: p95 latency > 2s
- Low Request Volume: might indicate issues

Would you like me to elaborate on any of these approaches or provide more specific implementation details for the recommended lightweight option?

Aug 21 '25 13:08 claude[bot]

@domdomegg I am more than happy to take care of remaining part of the issue. I would recommend these options:

Victoriametrics + vmalert (if dashboard is not required)
Victoriameterics + grafana (grafana alerts + dashboard)
Last9 https://last9.io/

Above suggestions are keeping in mind only for metrics, if we also wants to have logs then I can come up with logs and metrics solution both. Although last9 supports metrics, logs and traces.

All above are cost effective solutions, Last9 is not hosted it is a service but its bare minimum in terms of cost, rest can run as hosted service also using open source version of each.

@domdomegg let me know your thoughts on this, I can calibrate more as per your suggestion/feedback.

Aug 21 '25 14:08 pree-dew

@domdomegg feel free to tag me or assign any issue wherever you need help, I will be happy to contribute :)

Aug 21 '25 14:08 pree-dew

tldr: If you're up for it, I think I'd be happy for you to go ahead with option 2!

Awesome, thanks @pree-dew.

From my understanding it sounds like Victoriametrics is basically a more efficient prometheus, is that how you see it? I think this is a good idea.

Regarding vmalert vs grafana, I think I'd maybe prefer grafana. I think if we are collecting the metrics, we will want a way to visualise them reasonably easily. And then if we have grafana visualising them, using grafana alerting integrates a lot better with e.g. alerts showing on graphs more obviously.

On last9, I think this looks interesting! Especially with ability to also handle logs and traces. However in the interests of enabling fully open-source maintenance and promoting the use of vendor-neutral open-source tools etc. I think I'd lean towards one of the open-source options.

On logs in general, I think we get some of this for free via GitHub actions when it deploys services (where most failures will happen I think), because Pulumi shows the k8s logs for failures. So I think this is less urgent. But perhaps we could explore that as well at some point in the future - I've used Loki before and found it worked quite well and integrated with Grafana easily.

Aug 21 '25 21:08 domdomegg

Also as a pointer for where to start, I believe you should be able to do this in the deploy folder. The README there also has instructions for spinning up an entire cluster locally, so you can reproduce everything. My suggestion would be to look at how other services like cert-manager or ingress-nginx are set up, then do something similar - i.e. find a helm chart, install it with Pulumi. If you need a database etc. for grafana you can create a postgres database in postgres.go based on this branch, and you can see how registry.go uses this there too.

Happy to pair on this or hop on a quick call if you get stuck. You can also ping me in the registry Discord channel. (Discord joining details) :)

Aug 21 '25 21:08 domdomegg

@domdomegg sounds good. Yes, you are right, VictoriaMetrics is more efficient prometheus in terms of resources and performance. It is fully compatible with prometheus and has very rich set of functions/features available for promqls.

I will go through the deploy folder(setup), will try to understand how the new services can be integrated and come up with questions.

Few things that I think we should also consider in this task are (I will take care of these):

Resource requirements as per the number of timeseries getting generated both for main registry server as well as metric storage mostly at peak cardinality(as an estimate). Since we don't have any unbounded label in timeseries so we should be good in handling the cardinality without any issue. This is important since opentelemetry keeps all data in memory even if the timeseries is old.
I will do an end to end run of setting up some dummy alert on my local to simulate the failure scenario before we come up with real alerts, this will give an idea about if we have anything missing in terms of instrumentation or to control the end to end workflow for alerts.
At last, how the alerts should be configured via IaC or UI along with channels.

Looking forward to work with you @domdomegg Thank you for the opportunity :)

Aug 22 '25 07:08 pree-dew

@domdomegg while working on the Victoriametrics setup realised about target discovery. For vmagent we have to include target discovery currently pointing to only registry containers, might need custom resource to specify the selector for targets. Any feedback on this?

Aug 25 '25 22:08 pree-dew

@pree-dew Happy for you to propose what you think is best here! My guess is that it would be fine either to:

Hardcode the address for registry in vmagent, given that's the key (only?) think we need to measure
Use some custom k8s resource to specify the selector

Maybe a slight preference to the former to keep things simple, but can see the extensibility benefits of the latter so happy with either!

Aug 26 '25 16:08 domdomegg

@domdomegg few updates, I went through different setups for Victoriametrics + Vmagent, following are the setups that I tried:

Setups:

Single Node Victoriametrics (built in vmagent)
Single Node Victoriametrics + separate vmagent (deployment) + CRD for target discovery
Single Node Victoriametrics + separate vmagent (deployment) + scrape config during helm chart overwrite using pod label as the source to identify target.
Single Node Victoriametrics + separate vmagent (daemonset)
Single Node Victoriametrics + vmoperator

Note: I have not selected cluster version of Victoriametrics as of now as I don't the see the requirement of that whole setup as of now.

Analysed all these points for the above options:

Ease of deployment
Ease of maintenance
Modularity (no component should impact other while deploying)
Resource requirement.
Most imp, what is required at this stage of project.

I think for us Setup 3 will be the best at this stage. Let me know if this looks okay to you. Meanwhile I will work on the resource requirement for the storage part of Victoriametrics and buffer, scrap config etc config for vmagent.

Aug 27 '25 21:08 pree-dew

@pree-dew awesome to hear!

My (not super well informed) opinion is that setup 1 + CRDs sounds intuitively about right to me, but I also haven't played around with this closely. If you've analysed this and come to the conclusion that setup 3 is better happy for you to go with that :)

Definitely agree that single node victoriametrics will be sufficient, and we don't need a cluster deployment or significant scaling needs right now.

Aug 28 '25 21:08 domdomegg

@domdomegg Next set of changes are ready, raised a PR https://github.com/modelcontextprotocol/registry/pull/328/files

Few points related to PR:

Admin password is kept raw, as it will be accessed internally.
Alerts are not configured yet, if the setup looks okay to you, then adding alerts and notification channel will be the next target.
Instance label is a part of metric label as of now, thinking to drop it at agent level before remote writing to victoriametrics.

Happy to accomodate your feedback on the same, I have tested the setup but will do it bit more to calibrate resource requirements.

Aug 30 '25 23:08 pree-dew

@domdomegg can we move towards the next part of setting up alerts?

Sep 02 '25 06:09 pree-dew