retina icon indicating copy to clipboard operation
retina copied to clipboard

feat(advanced-metrics): Add Support for Topology (Availability) Zone Labels

Open rickardsjp opened this issue 6 months ago • 10 comments

Description

This PR introduces availability zone labels (topology.kubernetes.io/zone) for remote context advanced metrics.

With this PR, it is possible to add the zone to metrics as sourceLabel or destinationLabel via the MetricsConfiguration CRD.

The data is added via the enricher component to the RetinaMetadata in the Flow struct's Extensions attribute, because the node controller already has access to the labels of the Nodes.

Related Issue

https://github.com/microsoft/retina/issues/1654

Checklist

  • [x] I have read the contributing documentation.
  • [x] I signed and signed-off the commits (git commit -S -s ...). See this documentation on signing commits.
  • [x] I have correctly attributed the author(s) of the code.
  • [x] I have tested the changes locally.
  • [x] I have followed the project's style guidelines.
  • [x] I have updated the documentation, if necessary.
  • [x] I have added tests, if applicable.

Screenshots (if applicable) or Testing Completed

img

Additional Notes

We have not added it to the local context because we don't see the value in adding just one zone to a pod. We have not added it to the hubble scenario because hubble doesn't support availability zones for the destination yet. Tagging the co-authors: @ScheererJ @rrhubenov


Please refer to the CONTRIBUTING.md file for more information on how to contribute to this project.

rickardsjp avatar Jun 03 '25 16:06 rickardsjp

@microsoft-github-policy-service agree company="SAP SE"

rickardsjp avatar Jun 04 '25 08:06 rickardsjp

Done! Thanks for re-running the CI for all the force pushes, we had issues running some of the makefile targets locally. I also can't figure out why the Windows image builds fail - any pointers? I can't really make sense of the 503 and EOFs.

rickardsjp avatar Jun 04 '25 16:06 rickardsjp

@rickardsjp MCR is acting up for windows 2019 images, but it should be transient

nddq avatar Jun 05 '25 16:06 nddq

This PR will be closed in 7 days due to inactivity.

github-actions[bot] avatar Jul 06 '25 00:07 github-actions[bot]

Pull request closed due to inactivity.

github-actions[bot] avatar Jul 13 '25 00:07 github-actions[bot]

Any chance to recover this PR?

angelbarrera92 avatar Sep 10 '25 07:09 angelbarrera92

Hi @rickardsjp, would you like to continue working on this PR?

SRodi avatar Sep 26 '25 15:09 SRodi

Hi @SRodi, yep! Didn't mean to abandon it, just sort of ran out of capacity. Will try to pick it up again in the coming weeks.

rickardsjp avatar Sep 26 '25 16:09 rickardsjp

There were some changes to the protobuf definitions, so I rebased the PR for starters. I'll try and take another swing at the PR feedback in the coming weeks.

rickardsjp avatar Nov 25 '25 13:11 rickardsjp

Hi @SRodi,

can we have test coverage for negative/edge cases, like missing or malformed zones?

Malformed zones are sort of unlikely, the zone label on k8s nodes is a string and at most would restrict the length. Could you specify where you would like to see additional tests? I.e. in the controller test, the metrics test, or the types tests?

is there any backward compatibility consideration and potential impact on downstream consumers for RetinaEndpoint? (I do not think so, but have you considered/tested that?)

Does this mean the custom resource RetinaEndpoint or the internal RetinaEndpoint struct? The CR is unchanged and we do not foresee issues with the internal type.

shall we make the default zone "unknown" a configurable value? and maybe add logging for missing zones?

Configurable value: "unknown" is not a configurable value for other fields, see for example here. Logging: Added some debug level logs for empty zone fields.

rickardsjp avatar Nov 26 '25 14:11 rickardsjp