kubeflow-manifests icon indicating copy to clipboard operation
kubeflow-manifests copied to clipboard

Kubeflow integration with AWS managed Prometheus, Grafana and CW

Open goswamig opened this issue 2 years ago • 3 comments

Is your feature request related to a problem? Please describe. Logging and monitoring are two important aspect of MLOps platform. AWS distro of Kubeflow should have integration with aws managed Prometheus, grafana and cloudwatch(CW)

Describe the solution you'd like I believe we should order this in priority

  1. Logging: CW and Prometheus integration
  2. Monitoring: Grafana integration.

As a part of this issue, I would like to explore the viability of the option as well.

goswamig avatar Mar 04 '22 19:03 goswamig

Perhaps, this can be option given to the user during installation stage as a config option. All services of AWS should be given as config options and defined by the user during start of installation and have an automatic script running to do do the installation. This will be super cool and will speed up the whole installation time and the user experience

I did Prometheus and Grafana integration separately using below

  1. Download the monitoring-core.yaml and monitoring-metrics-prometheus.yaml files from the Knative 0.18 release (https://github.com/knative/serving/releases/tag/v0.18.0)
  2. kubectl apply -f monitoring-core.yaml
  3. kubectl apply -f monitoring-metrics-prometheus.yaml
  4. kubectl apply -f grafana-virtual-service.yaml
  5. kubectl apply -f grafana-config-map.yaml
  6. kubectl apply -f models-web-app-aut-policy.yaml
  7. kubectl rollout restart deployment/grafana -n knative-monitoring

Login to Kubeflow and check if metrics are populating under Models

Harikantipudi avatar Mar 12 '22 11:03 Harikantipudi

@Harikantipudi thanks for adding it. Is this only needed for serving or do we need something similar for notebook, tensorboard etc.?

goswamig avatar Mar 31 '22 23:03 goswamig

@goswamig , the above setup is more for log monitoring like usage of CPU, Memory etc post your inference serving in Model UI for Kubeflow. However this can be extended for performance metrics and other monitoring aspects too

Harikantipudi avatar Apr 01 '22 13:04 Harikantipudi

@goswamig Hello, is this issue stil WIP?

AlexandreBrown avatar Feb 08 '23 04:02 AlexandreBrown

@AlexandreBrown This feature is complete and available in v1.6.1-aws-b1.0.0. https://awslabs.github.io/kubeflow-manifests/docs/add-ons/prometheus/guide/ https://awslabs.github.io/kubeflow-manifests/docs/add-ons/cloudwatch/guide/

ananth102 avatar Feb 08 '23 18:02 ananth102