distributed icon indicating copy to clipboard operation
distributed copied to clipboard

Expose or proxy internal IPs of workers for Prometheus monitoring

Open adbreind opened this issue 5 years ago • 2 comments

In some deployments, like Fargate container deploys, it can be impossible for an external Prometheus monitoring host to directly reach workers on their published, yet private/internal network, IPs.

It would be useful if the scheduler or a process alongside the scheduler, perhaps using the "sidecar container pattern," could proxy requests to the workers from public IP/ports.

adbreind avatar Oct 09 '20 19:10 adbreind

Today, I faced with the same issue. Dask workers in GCP does not listen on external IP, so it is not easy to discover the Dask workers with the Prometheus default discovery job. Looking into different solution to force worker listen on 0.0.0.0

dbalabka avatar Jul 04 '25 18:07 dbalabka

@adbreind, the solution is to set proper settings for workers. Here is a solution for GCP:

from dask_cloudprovider.gcp import GCPCluster

cluster = GCPCluster(
        worker_options = {
            "dashboard_address": "0.0.0.0:8787",
        },
)

Now, the worker will listen on a public IP, so Prometheus will be able to access the /metrics links directly from the instance. Be cautious not to expose the port to the internet, and ensure that the firewall rules are properly set up.

Here is a possible Prometheus configuration:

global:
  scrape_interval: 5s

scrape_configs:
  - job_name: "dask-gce"
    metrics_path: "/metrics"           # Default is /metrics, but explicit here
    gce_sd_configs:
      - project: "your-gcp-project-id"      # Your GCP project :contentReference[oaicite:4]{index=4}
        zone: "us-central1-a"          # Your GCE zone
        filter: 'labels.container_vm = "dask-cloudprovider"'  
                                        # Only instances with this default dask-cloudprovider's label
        port: 8787                     # Dask’s HTTP status port
        refresh_interval: "5s"      # Refresh every 5 seconds

    relabel_configs:
      - source_labels: [__meta_gce_public_ip]
        regex: "(.+)"
        target_label: "__address__"
        replacement: "${1}:8787"      # Assemble <public_ip>:8787 for scraping

Similary it can be applied for AWS.

dbalabka avatar Jul 07 '25 16:07 dbalabka