promitor
promitor copied to clipboard
feat: Provide support for batch scraping
Hey Tom, hope all is well!
I've attached this diagram to illustrate my proposed change.
How to Batch?
Scrape jobs are currently defined per <resource, metric>. I'm taking the simplest approach to just batch resources looking for the same metric.
Max batch size will be a configurable parameter, under runtime configuration.
Batch Job Execution Flow
Changes to Internal Metrics
Internal Promitor metrics will need to be modified under batch scraping mode.
promitor_scrape_success
and promitor_scrape_error
metrics should be appended with an additional tag batched
, to indicate the mode of operation(0/1)
A histogram metric promitor_batch_size
should be added to track actual batch sizes
promitor_ratelimit_arm_throttled
/promitor_ratelimit_arm
metrics will no longer be published, since batch scraping does not go through ARM.
Other considerations
Batch scraping will not be free anymore but comes with other benefits(like no longer being subject to ARM throttling). This should be documented for all users
Implements #2284
Thank you for your contribution! 🙏 We will review it as soon as possible.
Thanks! Is it safe to say that the top of the diagram is current and bottom shows these changes?
Hey I'm revisiting this PR now since SDK migration is almost across the finish line. Do you know if the batch scraping API is available in sovereign clouds and what URLs we should use? We are particularly interested in the US Gov cloud. I can't find any information online(Azure doesn't really mentioned the existence of sovereign clouds to begin with)
Do you know if the batch scraping API is available in sovereign clouds and what URLs we should use?
I do not know, sorry! What is the URL for public cloud?
- US Government cloud :
.metrics.monitor.azure.us - Token audience: https://metrics.monitor.azure.us
- China cloud :
.metrics.monitor.azure.cn - Token audience: https://metrics.monitor.azure.cn
- https://metrics.monitor.azure.cn
Awesome! How did you find this out?
I reached out to the team :D
@hkfgo Do you know if you'll have time to finalize this PR next week? I'd like to cut a release next week.
@hkfgo Do you know if you'll have time to finalize this PR next week? I'd like to cut a release next week.
We should plan for batch scraping to be on the release after this coming one. There are still a decent amount of work left and I'm not able to work on this full time.
I do have time to get SDK migration done though. In that case we should have something major-ish to release :)
That's OK - Thank you!
@hkfgo Any thoughts on this PR? Are you able to continue working on it?
@hkfgo Any thoughts on this PR? Are you able to continue working on it?
Hey I got caught up in some other work and took a few weeks PTO. I'm back working on it now :)
Any thoughts on this btw: https://github.com/tomkerkhove/promitor/pull/2459#discussion_r1619981136?
Hey Tom, since you have a direct line to MSFT people, I have an issue w.r.t response data modeling in Azure.Monitor.Query
you could potentially help with :)
As you may know, response for a batch query is, at the highest level, separated into a list where each item contains metric results associated with a unique resource. For each item,resourceid
is included as a JSON attribute when the request is made through plain HTTP/REST. E.g.
"resourceid": "/subscriptions/be4fa4c1-8513-4755-95f9-a8627814ca36/resourceGroups/us2-main-ca-rto-messaging/providers/Microsoft.Cache/Redis/axon-messaging-us2"
The issue I have now is this attribute is missing in the most recent Azure.Monitor.Query
release. See their data model class.
This is not a blocker I believe, since resource id can still be parsed out of the id
attribute of the MetricResult
. i.e.
resource ID is a substring of result ID"id": "/subscriptions/be4fa4c1-8513-4755-95f9-a8627814ca36/resourceGroups/us2-main-ca-rto-messaging/providers/Microsoft.Cache/Redis/axon-messaging-us2/providers/Microsoft.Insights/metrics/errors"
But such string processing is not ideal of course
I'll open a GitHub issue too once I get around to it
I got some good news to share :) I observed, in the CI pipeline, the batch scrape jobs working end-to-end. That included grouping to batches, job dispatch, and results processing. While there are still a few thing to work out, I think what I have now is ready for another look! Would you mind doing another round of review?
Since this is almost a XL-sized PR, I've attached another diagram to help you(and myself) to understand execution flow of these batch jobs.
Also, these are some work items that I've excluded from this PR, in order to keep its size manageable. I believe it makes more sense to do these as follow-ups. Let me know if you agree?
- [x] Deal with potential JSON data volume issues with multi-threaded result processing. Per your comment here: https://github.com/tomkerkhove/promitor/pull/2459#discussion_r1551478172
- [x] Support for multi-region batch scraping. Since the batch endpoint is now regional(as opposed to global to cloud), we need to do additional work on Promitor's end to support use cases where resources are spread across regions.
- [x] Support batching for log analytics - batch scraping also exists for LogAnalytics, but I want to keep this PR to metrics only
- [x] Modify CI to test both single-resource and batch scraping - I think it makes sense to add another runtime config for batch scraping. This way we can test both single-resource and batch scraping in the CI pipeline.
Sorry for the delay, I have been out sick & traveling. Will get back to this asap
Also, these are some work items that I've excluded from this PR, in order to keep its size manageable. I believe it makes more sense to do these as follow-ups. Let me know if you agree?
- [ ] Deal with potential JSON data volume issues with multi-threaded result processing. Per your comment here: feat: Provide support for batch scraping #2459 (comment)
- [ ] Support for multi-region batch scraping. Since the batch endpoint is now regional(as opposed to global to cloud), we need to do additional work on Promitor's end to support use cases where resources are spread across regions.
- [ ] Support batching for log analytics - batch scraping also exists for LogAnalytics, but I want to keep this PR to metrics only
- [ ] Modify CI to test both single-resource and batch scraping - I think it makes sense to add another runtime config for batch scraping. This way we can test both single-resource and batch scraping in the CI pipeline.
Agreed!
I got some good news to share :) I observed, in the CI pipeline, the batch scrape jobs working end-to-end. That included grouping to batches, job dispatch, and results processing. While there are still a few thing to work out, I think what I have now is ready for another look! Would you mind doing another round of review?
Since this is almost a XL-sized PR, I've attached another diagram to help you(and myself) to understand execution flow of these batch jobs.
Is the goal to have it merged or just WIP? I'm asking because you mention some things are done in a seperate PR so want to et the expectations right.
If you want to merge it, please publish the PR
I got some good news to share :) I observed, in the CI pipeline, the batch scrape jobs working end-to-end. That included grouping to batches, job dispatch, and results processing. While there are still a few thing to work out, I think what I have now is ready for another look! Would you mind doing another round of review? Since this is almost a XL-sized PR, I've attached another diagram to help you(and myself) to understand execution flow of these batch jobs.
Is the goal to have it merged or just WIP? I'm asking because you mention some things are done in a seperate PR so want to et the expectations right.
If you want to merge it, please publish the PR
Right to be clear the goal is to merge this PR, and then do all the things I mentioned as follow-up. I'll create separate issues to track them. I've taken the "draft" label off
Thanks, I'll check tomorrow!
Added a few comments but generally looks good and trust you have tested this properly!
Would you mind opening a PR for our docs on https://github.com/promitor/docs as well please?
I have indeed done a fair amount of testing. In CI, I configured the scraper to run under batch scraping mode and it looked to have scraped identical metrics as single-resource scraping through ARM. I've also tested this in our staging environment, with Azure Redis and AzureSQL deployments. The transition going from single-resource scraping to batched scraping was fairly seamless, and the new histogram metric to track batch size appears to work as well.
I can use some help getting the CI pipeline unblocked btw. As you may have noticed, it's broken for both scraper and resource discovery agent. The resource discovery tests appear to be discovering more resources than expected:
[xUnit.net 00:00:06.39] Promitor.Tests.Integration.Services.ResourceDiscovery.ResourceDiscoveryTests.ResourceDiscoveryV1_GetWithFilterOnTwoRegions_ReturnsExpectedAmount [FAIL]
Failed Promitor.Tests.Integration.Services.ResourceDiscovery.ResourceDiscoveryTests.ResourceDiscoveryV2_GetWithFilterOnMultipleValueInstanceTag_ReturnsExpectedAmount [13 ms]
Error Message:
Assert.Equal() Failure: Values differ
Expected: 405
Actual: 409
Can you quickly check if there's a real change in number of logic app instances? If so we may need to update those tests
Scraper agent is failing too unfortunately :/ The timing(plus the fact that only OpenTelemetry Collector tests are failing) coincided with a [new release]((https://github.com/open-telemetry/opentelemetry-collector-releases/releases/tag/v0.104.0). Those tests worked when I reverted to an older version(0.103.0). I'm playing around with different configs to make tests work with the latest versions
One more thing - if we merge this PR, I'd have to run the CI in single-scraping mode. This is because some resources are in different Azure regions and batch scraping only works single-region right now. Multi-region support is a follow-up I'll take on later(will create an issue for it).
If that sounds good to you, let me clean up the code and address your comments. Looking forward to getting this merged soon :)
Sorry for the delay, checking now!
Added a few comments but generally looks good and trust you have tested this properly! Would you mind opening a PR for our docs on promitor/docs as well please?
I have indeed done a fair amount of testing. In CI, I configured the scraper to run under batch scraping mode and it looked to have scraped identical metrics as single-resource scraping through ARM. I've also tested this in our staging environment, with Azure Redis and AzureSQL deployments. The transition going from single-resource scraping to batched scraping was fairly seamless, and the new histogram metric to track batch size appears to work as well.
I can use some help getting the CI pipeline unblocked btw. As you may have noticed, it's broken for both scraper and resource discovery agent. The resource discovery tests appear to be discovering more resources than expected:
[xUnit.net 00:00:06.39] Promitor.Tests.Integration.Services.ResourceDiscovery.ResourceDiscoveryTests.ResourceDiscoveryV1_GetWithFilterOnTwoRegions_ReturnsExpectedAmount [FAIL] Failed Promitor.Tests.Integration.Services.ResourceDiscovery.ResourceDiscoveryTests.ResourceDiscoveryV2_GetWithFilterOnMultipleValueInstanceTag_ReturnsExpectedAmount [13 ms] Error Message: Assert.Equal() Failure: Values differ Expected: 405 Actual: 409
Can you quickly check if there's a real change in number of logic app instances? If so we may need to update those tests
Scraper agent is failing too unfortunately :/ The timing(plus the fact that only OpenTelemetry Collector tests are failing) coincided with a [new release]((open-telemetry/opentelemetry-collector-releases@
v0.104.0
(release)). Those tests worked when I reverted to an older version(0.103.0). I'm playing around with different configs to make tests work with the latest versions
Yeah, valid point - I have opened https://github.com/tomkerkhove/promitor/pull/2552
Alright, I've cleaned up the code and addressed all comments! This PR should be in a mergeable state now, besides that the Scraper CI is still failing. I created an issue to track that. I think the tests are just flaky since you can see them passing on recent runs: https://dev.azure.com/tomkerkhove/Promitor/_build?definitionId=50&_a=summary
Oh and I'm working on the public documentation at the moment. PR for that will coming soon.
Great! We're ready to go, let me know when the CI is passing 👌
Great! We're ready to go, let me know when the CI is passing 👌
I think this is okay to merge as is, since we are pretty sure now that the reason for failure is test flakiness. The only failing part of CI are the Scraper OTEL Collector tests on Linux(WIndows image built just fine). I don't mind retrying every now and then but might be better to fix the flakiness through a separate PR. Wdyt?
Oh hey the CI's passed after a few retries