opentelemetry-collector-contrib
opentelemetry-collector-contrib copied to clipboard
[azuremonitorreceiver] Add the ability to pull Resource Metrics using AZ Query Batch API
Add the ability to pull Resource Metrics using the AZ Query Batch API. Currently a Proof-of-Concept that does not allow Resources to exist in different regions and does not filter based on dimensions similar to how it's being handled in the previous implementation (by Resource).
Description: <Describe what has changed.>
This is an enhancement to try and avoid limitations when querying resource directly by using the AZ Query Batch API that was recently released for the GO SDK.
Link to tracking Issue: <Issue number if applicable>
Testing: <Describe what testing was performed and which tests were added.> Tested locally against a Resource Group in an single region. Some issues exist that still need to be investigated.
Documentation: <Describe the documentation added.> This will be updated once the Draft is discussed.
This PR was marked stale due to lack of activity. It will be closed in 14 days.
I have pulled the code and compiled a version, tested it on one subscription and it's a really good start!
As in first comment, we are seeing few errors but I am more concerned about this one:
"error": "POST https://northeurope.metrics.monitor.azure.com/subscriptions/56ffb7d8-dd02-4217-9f63-11093b6176ba/metrics:getBatch\n--------------------------------------------------------------------------------\nRESPONSE 400: 400 Bad Request\nERROR CODE: BadRequest\n--------------------------------------------------------------------------------\n{\n \"error\": {\n \"additionalInfo\": [\n {\n \"type\": \"string\",\n \"info\": \"TraceId=ca886d1c7e36172863695ed22c30acf2\"\n },\n {\n \"type\": \"string\",\n \"info\": \"ExceptionType=Microsoft.Online.Metrics.MetricsMP.Utilities.RPRequestFormatException\"\n }\n ],\n \"code\": \"BadRequest\",\n \"message\": \"Number of requests in the batch 301 exceeded max allowed: 50\"\n }\n}
I think this correspond to microsoft to: Up to 50 unique resource IDs may be specified in each call. Each resource must belong to the same subscription, region, and be of the same resource type. source
And also, I would like to start a discussion (maybe for a second PR / feature request) (ping @altuner @codeboten): At Amadeus, we have tons of subscriptions (like thousands), would it be possible to have auto discovery of them as well?
- there is a GET subscription available at
https://management.azure.com/subscriptions?api-version=2020-01-01
- and then the list resourcegroups also give the location (for the new getBatch API endpoint):
e.g: https://management.azure.com/subscriptions/14b86a40-8d8f-4e69-abaf-42cbb0b8a331/resourcegroups?api-version=2019-10-01
WDYT? (please tell me if you want me to raise a new issue for this feature request, this getBatch will anyway be a requirement first) Thank you, Alban
@ahurtaud - I ran into a few errors. But not that one specifically. I didn't realize that there was a limit to be honest. The documentation was kind of hard to understand. I had some concerns about the number of Dimensions that I was pulling so I did some testing there and it would be really nice to be able to determine the number of dimensions that a Metric can have instead of having to hard code the values. I think it would be pretty easy to break the requests up into batches of 50, I guess I didn't read all of the documentation :-D
Want to ping those who worked on this previously as I also think the component would benefit from switching to use the azquery
interfaces in the Azure SDK. I think that provides a lot of leverage to the purposes of this code. However, when experimenting with the changes and finding it was difficult to test all paths and similarly difficult to make the result of changes clear to reviewers... I thought we could benefit from all of the following.
- Separate fetching responsibility, so the mechanism of each type of fetch (i.e. resources, metric definitions, metric values) could have it's fetch specified in a simplified strategy.
- Separate caching/providing responsibility, so reviewers so it's simple to verify when state (i.e.
azureScraper.resources
is updated). - Potentially make changes to the update trigger for each state, so a pull through flow model wouldn't be our only option. I think this would provide some optionality for controlling consistency of reads if this is a concern later.
- With testing and verifying more paths in a straightforward way, start introducing changes to use higher-level APIs like
azquery
. - It looks like this would simplify scraping logs with
azquery
as well, though I haven't investigate it extensively.
In the long run these change could help with scalability, though it's unknown when/if that would be helpful. And my purpose for proposing the changes now is to simplify testing on more paths.
Before I opened an issue for changes around this I wanted to get input from those who have looked at this receiver and might have informed opinions to share.
Please share any thoughts you have on this: @cparkins @ahurtaud
I didn't realize that there was a limit to be honest. The documentation was kind of hard to understand.
I agree. Maybe it's just easy to confuse me when there are multiple numbers involved. Most things in the SDK are straightforward as you get familiar with their patterns and the Azure object model, but I believe some calls are limited to 20, I recall reading that one was limited to 10, and @hurtaud has errors that say limit of 50 on something else. Currently the default config in factory.go
says maximum number of metrics in a call is 20 though it's configurable. The scraper uses that field in this ad hoc batching mechanism. That's what the start
and end
variables are for, but I find it difficult to understand much less verify.
I had some concerns about the number of Dimensions that I was pulling so I did some testing there and it would be really nice to be able to determine the number of dimensions that a Metric can have instead of having to hard code the values.
The current mechanism that deals with dimensions is the metricsCompositeKey
where the dimensions field is a string built in this stanza from the metric definitions we get from Azure. Dimensions are sorted before joining to a comma delimited string, so that calls can be batched by requesting metric values in batches. (Tell me if the rationale doesn't make sense.)
I think it would be pretty easy to break the requests up into batches of 50.
Configurable batch sizes would definitely help. We could depend on a batcher package like this for example, make/copy our own like it, or use methods that will work almost as well if controlled in the context. The nice thing about the fancy one is that we don't have to manage keying on the metrics composite key, so you specify how to aggregate batches by options and you can fire calls as soon as you have a full batch or a configured batch timeframe expires.
Maybe this provides some useful information on the code's current state. Hopefully, it isn't a complete waist of time when the PoC version is considered. I plan to have a closer look at that code in a couple of hours.
I like where the PR is headed. I'm planning to open an issue as something like migrate to azquery package in SDK
and then can play with using the QueryResource
path or the Batch path. I plan to cite this PR as some of the PoC work in the issue. Happy to see the transition to azquery completed in this PR or open something different. I think both of these paths could ease the burden as far as a layers of calls and cache.
a batcher package like this for example
@nslaughter I like the thoughts you have on how to improve the receiver in general. While writing this code I thought some time should be taken to consider different ways to pull the various bits of data that are being used when scraping. Unfortunately I didn't take much time to look at other ways that we could pull the data that might be more efficient. I really like the idea of using the batcher though. And I think that eventually this code will run into issues with getting consistent metrics because of the way it's written (multiple timers triggering at multiple intervals). I'm open to ideas based on this Draft PR and look forward to seeing what you think once you get more in depth.
I had some concerns about the number of Dimensions that I was pulling so I did some testing there and it would be really nice to be able to determine the number of dimensions that a Metric can have instead of having to hard code the values.
The docs say that custom metrics can have up to 10 dimensions. The metric definitions list the preset dimensions for predefined ones. Does this help or you getting at something else?
I had some concerns about the number of Dimensions that I was pulling so I did some testing there and it would be really nice to be able to determine the number of dimensions that a Metric can have instead of having to hard code the values.
The docs say that custom metrics can have up to 10 dimensions. The metric definitions list the preset dimensions for predefined ones. Does this help or you getting at something else?
That does help actually. I was trying to determine the number of standard dimensions for a Storage Account and got to 30.... I probably need to look at the REST responses a little more closely to get this one solved better.
This PR was marked stale due to lack of activity. It will be closed in 14 days.
This PR was marked stale due to lack of activity. It will be closed in 14 days.
Closed as inactive. Feel free to reopen if this PR is still being worked on.