opentelemetry-collector-contrib
opentelemetry-collector-contrib copied to clipboard
Service name based routing for trace collector span metrics
Description: Adding a feature support for service name based routing for loadbalancing exporters. Instead of repeating the same service name in all hosts for span metrics collector, this feature will have one-one mapping between a service and collector host. This will have traceId routing by default if nothing is mentioned. If, we need a service based routing, it should be added in loadbalancing exporter config as below
Closes #12652
routing_key: "service"
Testing: Added unit test cases to check different routing types for e.g service based vs non service based which is traceId by default
@jpkrohling Thanks a lot for the review. Made the changes suggested and pushed.
@albertteoh Thanks a lot for your review, it was really helpful. Update the PR addressing your comments. Thanks in advance!
@jpkrohling @albertteoh Thanks again. Addressed the review comments
@jpkrohling @albertteoh Thanks again. Addressed the review comments
Are you able to answer @jpkrohling's question: https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/12421#discussion_r925995997? I'm quite curious about how and why service-name routing is used.
@albertteoh @jpkrohling we are currently using spanmetrics processor to generate red metrics and ingest into prometheus. these metrics are in the scope of a service+operation context. so if we either generate red metrics on a single tiered collector or 2 tiers with a load balancing exporter, each of the collectors would generate metrics for the service+operation name. this would cause label collisions in prometheus. to be able to avoid this we would like to be able to do collector 1 -> service based load balancing -> collector 2
where spanmetrics processor generates metrics. does this clarify?
Yes, I guess it does make sense. Can you add this to the readme? I'm sure people will find it useful to understand why this exists in the first place.
Thanks @vjsamuel for the clarification. @jpkrohling Updated the readme
Looks good to me. I'll wait for @albertteoh's feedback before moving forward with this.
does this clarify?
Yup, makes sense now, thanks @vjsamuel. I wonder if prometheus federation would help in this case (I have zero experience with it, but just heard about it mentioned a few times).
I wonder if prometheus federation
at smaller scale maybe. but if we have several 100s of services, depending on what granularity the RED metrics are being generated by the processor, it will become harder and harder for prom fed to handle.
@albertteoh or @jpkrohling Can you please help merge the PR?
This would be very useful to me. Hopefully we see it in soon.
@jpkrohling Thanks. I modified the readme. Can you please check and let me know anything else to be added in specific?
@jpkrohling could you kindly help move this forwad? are there more approvals required if i may ask?
Linting is failing:
Error: trace_exporter_test.go:189:18: Error return value of `p.Shutdown` is not checked (errcheck)
defer p.Shutdown(context.Background())
The committers listed above are authorized under a signed CLA.
- :white_check_mark: login: aishyandapalli (2f239114bdbe06796a6882a74be373ea0d915c9d, 315e30a1701f34fff1a7adf491c962ba7c789c5b, 9ecc1183af487289f8e0c02670de6961c10176b7, 0f10cd9a2278b969f68e9c72150231828de0e70f, bc5536aa5590a7dd5da29fa6678773a73b7a89ba, 8043b2ef67138ef3611894943519ba5c23cac90a, 73cb631ca6cea1fba3d04a2fba3821764bac4ac2, e6ccedd49e1f8135f7cd3e4515654705133e46a2, 905fa21bd7a747761aa1085cea88db57d4f0e7e4, 23a6839f4cf7389791c42ee7855eb5836cb96704, b59750297f82b9e8e6b4cd2a37ab2749b909f7b7, 8519b050d09917d4d1a86932ddb3793aca470516, b2168bc7ad65c2bce44a3774a8bf09571c8ad71a, ba60f5e940fc79ef8c11ae395e77d9946123c24f, 5ac761d1cdb09a7611e830402923a597a33da20b, 5b327ad616a17463fb9cc64c9a89da687f464d20, b4abe151795007ed13a208930d54f1c6772878df, 88bfb07d5245aaa62825cb2ba804a3c1b5426159, 95abef3e5b3f36b982eb2827ecd5515d4099fc0b)
@jpkrohling Thanks for verifying. Fixed the lint issue. Looks like all the builds are success now