kepler icon indicating copy to clipboard operation
kepler copied to clipboard

OpenTelemetry deployment or API integration

Open rootfs opened this issue 2 years ago • 21 comments

For deployment integration, evaluate the architecture of metrics -> telemetry adapter For API integration, evaluate telemetry client scalability in kepler

rootfs avatar Apr 27 '23 12:04 rootfs

for api migration, maybe we need to double check if open telemetry supports all kinds of kepler metrics today. as I found Summaries type of metrics is marked as legacy in openTelemetry without migration guide.

SamYuan1990 avatar May 14 '23 08:05 SamYuan1990

meeting 30: implement otel api client in kepler and emit telemetry directly. Hopefully there is a way to convert metrics to telemetry. @husky-parul will take the first try.

previous discussion is here https://github.com/sustainable-computing-io/kepler/issues/97

rootfs avatar May 30 '23 11:05 rootfs

Just verified and we can export OpenTelemetric metrics and then by using OpenTelemetry Collector we can also expose metrics to Prometheus.

Therefore If the user has OpenTelemetry Collector deployed in the Cluster, Kepler does not need to export Prometheus metrics.

So we need to make it configurable and avoid duplications. That is, if OpenTelemetry metrics are enabled, we should disable Prometheus metrics and vice-versa.

marceloamaral avatar May 31 '23 04:05 marceloamaral

Recap

Towards our migration to OpenTelemetry Mterics from Prometheus metrics to allow vendor- and tool-agnostic observability I did an initial POC of instrumenting an exporter using OTEL SDK and collecting metrics using otel collector and dashboard using grafana (poc example)

Before starting with the migration I was looking into kepler code to identify metrics type. So far I see kepler uses only Counters and Gauge
https://github.com/sustainable-computing-io/kepler/blob/main/pkg/collector/prometheus_process_collector.go#L30

Otel SDK supports Synchronous Counter and Asynchronous GaugeObserver. They have highlights a point about GaugeObserver:

For GaugeObserver timeseries, backends usually display the last value and don't allow to sum different timeseries together.

It should not affect our implementation though. @rootfs @marceloamaral @sunya-ch @SamYuan1990 @bertysentry

husky-parul avatar Jul 16 '23 22:07 husky-parul

This is awesome @husky-parul! WRT metric types, make sure to use Gauge only for metrics that are usually not summable (additive), like temperature, ratios, etc. For other metrics that move "up and down", like measured electrical power, you should use UpDownCounter. See OpenTelemetry Supplementary Guidelines about this.

bertysentry avatar Jul 17 '23 08:07 bertysentry

I am proposing the following. @sustainable-computing-io/maintainer please TAL. Let me know if you have any questions.

design

Components

Instrumentation: Kepler instrumented using the OTEL SDK to collect metrics.

OTEL Collector: The OTEL collector receives the exported metrics data from the instrumented applications. The collector acts as an intermediary component that processes and routes the telemetry data to the appropriate destinations. For Kepler we are going to support OpenTelemetry protocol (OTLP), to receive data from the instrumented applications.

Exporters: The OTEL collector will utilize OTEL exporters to send metrics data to backends. We are currently using Prometheus as backend but other options include InfluxDB, ElasticSearch. We will be using OTEL Prometheus Exporter with Grafana These exporters convert the collected metrics into a format that Grafana can understand and consume.

Data Storage: The exported metrics data is stored in Prometheus.

Grafana Data Source: Grafana will configure to connect to the Prometheus data storage backend where the metrics data is stored. Connection is established through the Prometheus data source within Grafana.

Visualization in Grafana: Grafana can query the metrics data from the storage backend and create visualizations based on the collected metrics.

husky-parul avatar Jul 17 '23 13:07 husky-parul

When using the Prometheus exporter, I recommend enabling the normalization of metric names with this flag: --feature-gates=pkg.translator.prometheus.NormalizeName. Otel metric names will be normalized as described here

bertysentry avatar Jul 18 '23 23:07 bertysentry

@husky-parul this looks great! Look forward to this happening!

rootfs avatar Jul 19 '23 01:07 rootfs

Looks good to me! Thanks for working on this!

marceloamaral avatar Jul 21 '23 05:07 marceloamaral

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Sep 19 '23 05:09 stale[bot]

I don't think this is stale. This issue should get more attention as OTel is quickly becoming the de-facto standard to export telemetry everywhere.

brunobat avatar Sep 19 '23 07:09 brunobat

It is not stale. I am working on this and a demo/PR is WIP.

husky-parul avatar Sep 19 '23 07:09 husky-parul

Also good news: future version of Prometheus will be capable of ingesting Otel metrics, and the mechanism to translate Otel metrics to Prometheus metrics is the one I mentioned earlier.

bertysentry avatar Sep 19 '23 09:09 bertysentry

thank you @bertysentry for the info! We are going to make this happen in the next release milestone. Stay tuned!

rootfs avatar Sep 21 '23 12:09 rootfs

@husky-parul that https://github.com/sustainable-computing-io/kepler/issues/659#issuecomment-1638153081 is great. Do you already have some updates?

@rootfs Just out of curiosity, is there a timeline for the next milestone?

frzifus avatar Sep 29 '23 14:09 frzifus

Thanks for sharing the information.

Just out of curiosity, is there a timeline for the next milestone?

@frzifus Otel integration will be part of our next release which will be part of 0.7 in this case. Our releases takes place every 6 months, so it will be happening in Q1 of 2024

husky-parul avatar Oct 02 '23 09:10 husky-parul

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Dec 01 '23 10:12 stale[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jan 30 '24 18:01 stale[bot]

are we done for this ticket? @rootfs

SamYuan1990 avatar Feb 11 '24 02:02 SamYuan1990

Any documentation for using otel to collect metrics from kepler? Thanks

gyliu513 avatar Mar 05 '24 20:03 gyliu513

https://github.com/husky-parul/otel-observability

Please try this. We haven’t merged this doc into Kepler website yet. Thanks

husky-parul avatar Mar 07 '24 07:03 husky-parul