keda
keda copied to clipboard
Provide an OpenTelemetry scaler
Proposal
OpenTelemetry allows applications/vendors to push metrics to a collector or integrate it's own exporters in the app.
KEDA should provide an OpenTelemetry scaler which is used as an exporter so we can pull metrics and scale accordingly.
Scaler Source
OpenTelemetry Metrics
Scaling Mechanics
Scale based on returned metrics.
Authentication Source
TBD
Anything else?
OpenTelemetry Metrics are still in beta but going GA by end of the year.
Go SDK: https://github.com/open-telemetry/opentelemetry-go
It's a really good improvement, I will have a look to see if I can help with this topic 🙂
Awesome, thank you!
@mknet3 , ping me if you need help ;)
just to confirm, I'm on it and I will help with this scaler
Great, thanks!
Hi @tomkerkhove, I have had a look at this issue and I would like to clarify some things. AFAIK the goal of this issue is to provide an scaler based on metrics exported by an exporter configured in the collector. This exporter will expose metrics in a KEDA format to be read by the scaler. Quick question, does the exporter already exist or is there a plan to develop it? (I suppose will be in opentelemetry-collector-contrib). This question is to figure out what will be the format of the exposed data to pull it in the scaler.
That would be part of the investigation but I think we'll need to build our own exporter to get the metrics in; or use the gRPC OTEL exporter / HTTP OTEL exporter as a starting point to push it to KEDA.
I'd prefer the latter approach to get started as we don't have a preference on the metric format, so OTEL is fine.
@mknet3 prefer to keep it free for the moment because it's his first task with golang
Working on this
Before we go all in, might be good to post a proposal here @SushmithaVReddy to avoid having to redo things but think relying on OTEL exporter is best
@tomkerkhove , sure. I'll put a proposal here before we start the implementation.
Quick doubt: Is the idea here to scale based on the metrics obtained from the data type -go.opentelemetry.io/otel/exporters/otlp/otlpmetrics ?
@tomkerkhove Will KEDA be acting as a collector that gets metrics data from an exporter? Is the idea to create metrics using https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/api.md#instrument and observe through hpa to scale accordingly? Slightly confused on the term exporter and collector w.r.t KEDA. Plausible solution looks like one where user has an exporter and exports metrics, keda connects to this exporter and gets metrics (collector?) for scaling decision based on the same metrics being mentioned in scaled object.
The idea is to use the OTEL exporter (https://github.com/open-telemetry/opentelemetry-collector/blob/main/exporter/otlpexporter/README.md) from which KEDA fetches metrics to make scaling decision on.
This is similar to how we integrate with Prometheus where we pull the metrics from Prometheus and move on, however, here it's in OTEL format coming from the expected OTEL exporter that end-users have to add to their OTEL collector (so not up to KEDA)
From an end-user perspective, they should give us:
- Uri of OTEL endpoint to talk to on the collector (but they add the following to their collector: https://github.com/open-telemetry/opentelemetry-collector/blob/main/exporter/otlpexporter/README.md#getting-started)
- Optional parameter to use gRPC or HTTP (but we can just start with gRPC for now as well)
Hope that helps?
The idea is to use the OTEL exporter (https://github.com/open-telemetry/opentelemetry-collector/blob/main/exporter/otlpexporter/README.md) from which KEDA fetches metrics to make scaling decision on.
This is similar to how we integrate with Prometheus where we pull the metrics from Prometheus and move on, however, here it's in OTEL format coming from the expected OTEL exporter that end-users have to add to their OTEL collector (so not up to KEDA)
From an end-user perspective, they should give us:
- Uri of OTEL endpoint to talk to on the collector (but they add the following to their collector: https://github.com/open-telemetry/opentelemetry-collector/blob/main/exporter/otlpexporter/README.md#getting-started)
- Optional parameter to use gRPC or HTTP (but we can just start with gRPC for now as well)
Hope that helps?
This helps Tom. Thanks!
Before we go all in, might be good to post a proposal here @SushmithaVReddy to avoid having to redo things but think relying on OTEL exporter is best
@tomkerkhove any thoughts on the scaled object here(ref below). The idea is to use OTEL (https://pkg.go.dev/go.opentelemetry.io/otel) and connect to the endpoint mentioned in the scaledobject and pull the metric value and compare to the threshold to scale.
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: opentelemetry-scaledobject
namespace: keda
labels:
deploymentName: dummy
spec:
maxReplicaCount: 12
scaleTargetRef:
name: dummy
triggers:
- type: opentelemetry
metadata:
exporter: http://otel-collector:4317
metrics:
- metricName: http_requests_total
threshold: '100'
authenticationRef:
name: authdata
I was also wondering about scenario's where users want to pull multiple metrics from their application and scale based on conditions on the metrics. Eg as below
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: opentelemetry-scaledobject
namespace: keda
labels:
deploymentName: dummy
spec:
maxReplicaCount: 12
scaleTargetRef:
name: dummy
triggers:
- type: opentelemetry
metadata:
exporter: http://otel-collector:4317
metrics:
- metricName: http_requests_total
threshold: '100'
operator: greaterthan
- metricName: http_timeouts
threshold: '5'
operator: lesserthan
query: http_requests_total and http_timeouts
authenticationRef:
name: authdata
Any ideas on what is the scope of the scalar we'll be building in terms of multiple metrics?
It's ok for me to use that package since that's the official SDK - Thanks for checking.
I don't see the difference between both proposals other than one vs multiple metrics though? Can you elaborate on it?
In terms of supporting multiple metrics - I'd argue that given we support multiple triggers it might be more aligned with other scalers to only support 1 metric per trigger to keep a consistent approach in KEDA. The only consideration I would have here is performance but I think we can manage that in the implementation. Thoughts @zroubalik @JorTurFer?
Based on that we'll need to review the YAML spec but in general I think it's ok; however if we use multiple levels then I would use exporter.url
instead of exporter
given we might need auth in the future or similar settings.
It's ok for me to use that package since that's the official SDK - Thanks for checking.
I don't see the difference between both proposals other than one vs multiple metrics though? Can you elaborate on it?
In terms of supporting multiple metrics - I'd argue that given we support multiple triggers it might be more aligned with other scalers to only support 1 metric per trigger to keep a consistent approach in KEDA. The only consideration I would have here is performance but I think we can manage that in the implementation. Thoughts @zroubalik @JorTurFer?
Based on that we'll need to review the YAML spec but in general I think it's ok; however if we use multiple levels then I would use
exporter.url
instead ofexporter
given we might need auth in the future or similar settings.
Yes @tomkerkhove the proposals point out multiple metrics usage as you understood. I agree with the consistency over other scalars we have, but I'm concerned about how much value will our scaling add considering it can scale on a single metric where open-telemetry's is majorly used to spit a lot of metrics.
nitpick: If we have one metric/scaled object and user wants to scale based on multiple metrics and goes ahead and creates that many scaled objects, I wonder how we handle concurrent scenarios where multiple metrics will result in scaling (over scaling? because the scaled up instances could've been reused? )
It would be nice to also make the protocol configurable given OTEL supports both http and gRPC
It's ok for me to use that package since that's the official SDK - Thanks for checking.
I don't see the difference between both proposals other than one vs multiple metrics though? Can you elaborate on it?
In terms of supporting multiple metrics - I'd argue that given we support multiple triggers it might be more aligned with other scalers to only support 1 metric per trigger to keep a consistent approach in KEDA. The only consideration I would have here is performance but I think we can manage that in the implementation. Thoughts @zroubalik @JorTurFer?
Based on that we'll need to review the YAML spec but in general I think it's ok; however if we use multiple levels then I would use
exporter.url
instead ofexporter
given we might need auth in the future or similar settings.Yes @tomkerkhove the proposals point out multiple metrics usage as you understood. I agree with the consistency over other scalars we have, but I'm concerned about how much value will our scaling add considering it can scale on a single metric where open-telemetry's is majorly used to spit a lot of metrics.
nitpick: If we have one metric/scaled object and user wants to scale based on multiple metrics and goes ahead and creates that many scaled objects, I wonder how we handle concurrent scenarios where multiple metrics will result in scaling (over scaling? because the scaled up instances could've been reused? )
Ha, but we have this covered already today.
Customer should only create 1 SO per scale target (which we will provide validation soon). However, 1 SO can have 1 or more triggers and start scaling as soon as one of them meets the criteria. You can learn more about that in our concepts.
Right, the below one makes sense?
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: opentelemetry-scaledobject
namespace: keda
labels:
deploymentName: dummy
spec:
maxReplicaCount: 12
scaleTargetRef:
name: dummy
triggers:
- type: opentelemetry
metadata:
exporter:
protocol: grpc
url: http://otel-collector:4317
metric:
name: http_requests_total
threshold: '100'
authenticationRef:
name: authdata
- type: opentelemetry
metadata:
exporter:
protocol: grpc
url: http://otel-collector:4317
metric:
name: http_errors
threshold: '10'
authenticationRef:
name: authdata
Yeah, this is correct, you can define multiple triggers per SO. Just one thing, the metric.name
is related to otel?
Yeah, this is correct. Just one thing, the
metric.name
is related to otel?
metric.name will be used to pull the metric. It should match with the name user have in their instrumented application.
Okay, and let's make the trigger metadata flat, to be in sync with other scalers:
Something like this, feel free to rename/update the fields to follow OTEL convetions.
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: opentelemetry-scaledobject
namespace: keda
labels:
deploymentName: dummy
spec:
maxReplicaCount: 12
scaleTargetRef:
name: dummy
triggers:
- type: opentelemetry
metadata:
protocol: grpc
exporter: http://otel-collector:4317
metric: http_requests_total
threshold: '100'
authenticationRef:
name: authdata
Sounds good to me.
Maybe I misunderstand something about this conversation, but given all pods in a replicaset will contain an otel collector, which one would the keda autoscaler talk to in order to make the decisions?
Also, how would you apply aggregates across metric labels?
KEDA will not manage the OTEL collector and is something you'd need to run separately next to KEDA/in your cluster.
Does that clarify it?
Sorry maybe my question was not clear enough.
If you have 10 pods, all of which have otel sidecars running, which will keda talk to? If just one, it won't have enough information to base scaling decisions. If it talks to all of them then how will it generate aggregates of the data across all?
There is no sidecar involved, there will be a separate deployment that KEDA integrates with through a Kubernetes service. End-users will have to bring their own OpenTelemetry Collector: https://opentelemetry.io/docs/collector/deployment/
Any update on this @SushmithaVReddy ?
The priorities of @SushmithaVReddy have changed and no longer has time to complete the task so I'm unassigning her.