promitor icon indicating copy to clipboard operation
promitor copied to clipboard

feat: Provide support for batch scraping

Open hkfgo opened this issue 11 months ago • 16 comments

Hey Tom, hope all is well!

I've attached this diagram to illustrate my proposed change. Screenshot 2024-03-22 at 10 23 21 PM

How to Batch?

Scrape jobs are currently defined per <resource, metric>. I'm taking the simplest approach to just batch resources looking for the same metric.

Max batch size will be a configurable parameter, under runtime configuration.

Batch Job Execution Flow

Screenshot 2024-08-21 at 5 37 25 PM

Changes to Internal Metrics

Internal Promitor metrics will need to be modified under batch scraping mode. promitor_scrape_success and promitor_scrape_error metrics should be appended with an additional tag batched, to indicate the mode of operation(0/1) A histogram metric promitor_batch_size should be added to track actual batch sizes promitor_ratelimit_arm_throttled/promitor_ratelimit_arm metrics will no longer be published, since batch scraping does not go through ARM.

Other considerations

Batch scraping will not be free anymore but comes with other benefits(like no longer being subject to ARM throttling). This should be documented for all users

Implements #2284

hkfgo avatar Mar 23 '24 05:03 hkfgo

Thank you for your contribution! 🙏 We will review it as soon as possible.

github-actions[bot] avatar Mar 23 '24 05:03 github-actions[bot]

Thanks! Is it safe to say that the top of the diagram is current and bottom shows these changes?

tomkerkhove avatar Apr 04 '24 10:04 tomkerkhove

Hey I'm revisiting this PR now since SDK migration is almost across the finish line. Do you know if the batch scraping API is available in sovereign clouds and what URLs we should use? We are particularly interested in the US Gov cloud. I can't find any information online(Azure doesn't really mentioned the existence of sovereign clouds to begin with)

hkfgo avatar May 09 '24 22:05 hkfgo

Do you know if the batch scraping API is available in sovereign clouds and what URLs we should use?

I do not know, sorry! What is the URL for public cloud?

tomkerkhove avatar May 16 '24 07:05 tomkerkhove

  • US Government cloud : .metrics.monitor.azure.us
    • Token audience: https://metrics.monitor.azure.us
  • China cloud : .metrics.monitor.azure.cn
    • Token audience: https://metrics.monitor.azure.cn

tomkerkhove avatar May 21 '24 13:05 tomkerkhove

  • https://metrics.monitor.azure.cn

Awesome! How did you find this out?

hkfgo avatar May 30 '24 05:05 hkfgo

I reached out to the team :D

tomkerkhove avatar May 30 '24 07:05 tomkerkhove

@hkfgo Do you know if you'll have time to finalize this PR next week? I'd like to cut a release next week.

tomkerkhove avatar Jun 01 '24 06:06 tomkerkhove

@hkfgo Do you know if you'll have time to finalize this PR next week? I'd like to cut a release next week.

We should plan for batch scraping to be on the release after this coming one. There are still a decent amount of work left and I'm not able to work on this full time.

I do have time to get SDK migration done though. In that case we should have something major-ish to release :)

hkfgo avatar Jun 01 '24 21:06 hkfgo

That's OK - Thank you!

tomkerkhove avatar Jun 03 '24 18:06 tomkerkhove

@hkfgo Any thoughts on this PR? Are you able to continue working on it?

tomkerkhove avatar Jul 15 '24 12:07 tomkerkhove

@hkfgo Any thoughts on this PR? Are you able to continue working on it?

Hey I got caught up in some other work and took a few weeks PTO. I'm back working on it now :)

hkfgo avatar Jul 15 '24 20:07 hkfgo

Any thoughts on this btw: https://github.com/tomkerkhove/promitor/pull/2459#discussion_r1619981136?

hkfgo avatar Jul 15 '24 20:07 hkfgo

Hey Tom, since you have a direct line to MSFT people, I have an issue w.r.t response data modeling in Azure.Monitor.Query you could potentially help with :)

As you may know, response for a batch query is, at the highest level, separated into a list where each item contains metric results associated with a unique resource. For each item,resourceid is included as a JSON attribute when the request is made through plain HTTP/REST. E.g. "resourceid": "/subscriptions/be4fa4c1-8513-4755-95f9-a8627814ca36/resourceGroups/us2-main-ca-rto-messaging/providers/Microsoft.Cache/Redis/axon-messaging-us2"

The issue I have now is this attribute is missing in the most recent Azure.Monitor.Query release. See their data model class.

This is not a blocker I believe, since resource id can still be parsed out of the id attribute of the MetricResult. i.e. resource ID is a substring of result ID"id": "/subscriptions/be4fa4c1-8513-4755-95f9-a8627814ca36/resourceGroups/us2-main-ca-rto-messaging/providers/Microsoft.Cache/Redis/axon-messaging-us2/providers/Microsoft.Insights/metrics/errors"But such string processing is not ideal of course

I'll open a GitHub issue too once I get around to it

hkfgo avatar Aug 01 '24 23:08 hkfgo

I got some good news to share :) I observed, in the CI pipeline, the batch scrape jobs working end-to-end. That included grouping to batches, job dispatch, and results processing. While there are still a few thing to work out, I think what I have now is ready for another look! Would you mind doing another round of review?

Since this is almost a XL-sized PR, I've attached another diagram to help you(and myself) to understand execution flow of these batch jobs.

hkfgo avatar Aug 22 '24 00:08 hkfgo

Also, these are some work items that I've excluded from this PR, in order to keep its size manageable. I believe it makes more sense to do these as follow-ups. Let me know if you agree?

  • [x] Deal with potential JSON data volume issues with multi-threaded result processing. Per your comment here: https://github.com/tomkerkhove/promitor/pull/2459#discussion_r1551478172
  • [x] Support for multi-region batch scraping. Since the batch endpoint is now regional(as opposed to global to cloud), we need to do additional work on Promitor's end to support use cases where resources are spread across regions.
  • [x] Support batching for log analytics - batch scraping also exists for LogAnalytics, but I want to keep this PR to metrics only
  • [x] Modify CI to test both single-resource and batch scraping - I think it makes sense to add another runtime config for batch scraping. This way we can test both single-resource and batch scraping in the CI pipeline.

hkfgo avatar Aug 22 '24 00:08 hkfgo

Sorry for the delay, I have been out sick & traveling. Will get back to this asap

tomkerkhove avatar Sep 02 '24 09:09 tomkerkhove

Also, these are some work items that I've excluded from this PR, in order to keep its size manageable. I believe it makes more sense to do these as follow-ups. Let me know if you agree?

  • [ ] Deal with potential JSON data volume issues with multi-threaded result processing. Per your comment here: feat: Provide support for batch scraping #2459 (comment)
  • [ ] Support for multi-region batch scraping. Since the batch endpoint is now regional(as opposed to global to cloud), we need to do additional work on Promitor's end to support use cases where resources are spread across regions.
  • [ ] Support batching for log analytics - batch scraping also exists for LogAnalytics, but I want to keep this PR to metrics only
  • [ ] Modify CI to test both single-resource and batch scraping - I think it makes sense to add another runtime config for batch scraping. This way we can test both single-resource and batch scraping in the CI pipeline.

Agreed!

tomkerkhove avatar Sep 02 '24 09:09 tomkerkhove

I got some good news to share :) I observed, in the CI pipeline, the batch scrape jobs working end-to-end. That included grouping to batches, job dispatch, and results processing. While there are still a few thing to work out, I think what I have now is ready for another look! Would you mind doing another round of review?

Since this is almost a XL-sized PR, I've attached another diagram to help you(and myself) to understand execution flow of these batch jobs.

Is the goal to have it merged or just WIP? I'm asking because you mention some things are done in a seperate PR so want to et the expectations right.

If you want to merge it, please publish the PR

tomkerkhove avatar Sep 02 '24 09:09 tomkerkhove

I got some good news to share :) I observed, in the CI pipeline, the batch scrape jobs working end-to-end. That included grouping to batches, job dispatch, and results processing. While there are still a few thing to work out, I think what I have now is ready for another look! Would you mind doing another round of review? Since this is almost a XL-sized PR, I've attached another diagram to help you(and myself) to understand execution flow of these batch jobs.

Is the goal to have it merged or just WIP? I'm asking because you mention some things are done in a seperate PR so want to et the expectations right.

If you want to merge it, please publish the PR

Right to be clear the goal is to merge this PR, and then do all the things I mentioned as follow-up. I'll create separate issues to track them. I've taken the "draft" label off

hkfgo avatar Sep 03 '24 05:09 hkfgo

Thanks, I'll check tomorrow!

tomkerkhove avatar Sep 09 '24 06:09 tomkerkhove

Added a few comments but generally looks good and trust you have tested this properly!

Would you mind opening a PR for our docs on https://github.com/promitor/docs as well please?

I have indeed done a fair amount of testing. In CI, I configured the scraper to run under batch scraping mode and it looked to have scraped identical metrics as single-resource scraping through ARM. I've also tested this in our staging environment, with Azure Redis and AzureSQL deployments. The transition going from single-resource scraping to batched scraping was fairly seamless, and the new histogram metric to track batch size appears to work as well.

I can use some help getting the CI pipeline unblocked btw. As you may have noticed, it's broken for both scraper and resource discovery agent. The resource discovery tests appear to be discovering more resources than expected:

[xUnit.net 00:00:06.39]     Promitor.Tests.Integration.Services.ResourceDiscovery.ResourceDiscoveryTests.ResourceDiscoveryV1_GetWithFilterOnTwoRegions_ReturnsExpectedAmount [FAIL]
  Failed Promitor.Tests.Integration.Services.ResourceDiscovery.ResourceDiscoveryTests.ResourceDiscoveryV2_GetWithFilterOnMultipleValueInstanceTag_ReturnsExpectedAmount [13 ms]
  Error Message:
   Assert.Equal() Failure: Values differ
Expected: 405
Actual:   409

Can you quickly check if there's a real change in number of logic app instances? If so we may need to update those tests

Scraper agent is failing too unfortunately :/ The timing(plus the fact that only OpenTelemetry Collector tests are failing) coincided with a [new release]((https://github.com/open-telemetry/opentelemetry-collector-releases/releases/tag/v0.104.0). Those tests worked when I reverted to an older version(0.103.0). I'm playing around with different configs to make tests work with the latest versions

hkfgo avatar Sep 19 '24 07:09 hkfgo

One more thing - if we merge this PR, I'd have to run the CI in single-scraping mode. This is because some resources are in different Azure regions and batch scraping only works single-region right now. Multi-region support is a follow-up I'll take on later(will create an issue for it).

If that sounds good to you, let me clean up the code and address your comments. Looking forward to getting this merged soon :)

hkfgo avatar Sep 19 '24 07:09 hkfgo

Sorry for the delay, checking now!

tomkerkhove avatar Sep 23 '24 17:09 tomkerkhove

Added a few comments but generally looks good and trust you have tested this properly! Would you mind opening a PR for our docs on promitor/docs as well please?

I have indeed done a fair amount of testing. In CI, I configured the scraper to run under batch scraping mode and it looked to have scraped identical metrics as single-resource scraping through ARM. I've also tested this in our staging environment, with Azure Redis and AzureSQL deployments. The transition going from single-resource scraping to batched scraping was fairly seamless, and the new histogram metric to track batch size appears to work as well.

I can use some help getting the CI pipeline unblocked btw. As you may have noticed, it's broken for both scraper and resource discovery agent. The resource discovery tests appear to be discovering more resources than expected:

[xUnit.net 00:00:06.39]     Promitor.Tests.Integration.Services.ResourceDiscovery.ResourceDiscoveryTests.ResourceDiscoveryV1_GetWithFilterOnTwoRegions_ReturnsExpectedAmount [FAIL]
  Failed Promitor.Tests.Integration.Services.ResourceDiscovery.ResourceDiscoveryTests.ResourceDiscoveryV2_GetWithFilterOnMultipleValueInstanceTag_ReturnsExpectedAmount [13 ms]
  Error Message:
   Assert.Equal() Failure: Values differ
Expected: 405
Actual:   409

Can you quickly check if there's a real change in number of logic app instances? If so we may need to update those tests

Scraper agent is failing too unfortunately :/ The timing(plus the fact that only OpenTelemetry Collector tests are failing) coincided with a [new release]((open-telemetry/opentelemetry-collector-releases@v0.104.0 (release)). Those tests worked when I reverted to an older version(0.103.0). I'm playing around with different configs to make tests work with the latest versions

Yeah, valid point - I have opened https://github.com/tomkerkhove/promitor/pull/2552

tomkerkhove avatar Sep 23 '24 18:09 tomkerkhove

Alright, I've cleaned up the code and addressed all comments! This PR should be in a mergeable state now, besides that the Scraper CI is still failing. I created an issue to track that. I think the tests are just flaky since you can see them passing on recent runs: https://dev.azure.com/tomkerkhove/Promitor/_build?definitionId=50&_a=summary

hkfgo avatar Sep 26 '24 08:09 hkfgo

Oh and I'm working on the public documentation at the moment. PR for that will coming soon.

hkfgo avatar Sep 26 '24 08:09 hkfgo

Great! We're ready to go, let me know when the CI is passing 👌

tomkerkhove avatar Oct 01 '24 10:10 tomkerkhove

Great! We're ready to go, let me know when the CI is passing 👌

I think this is okay to merge as is, since we are pretty sure now that the reason for failure is test flakiness. The only failing part of CI are the Scraper OTEL Collector tests on Linux(WIndows image built just fine). I don't mind retrying every now and then but might be better to fix the flakiness through a separate PR. Wdyt?

hkfgo avatar Oct 02 '24 00:10 hkfgo

Oh hey the CI's passed after a few retries

hkfgo avatar Oct 02 '24 02:10 hkfgo