cloud_controller_ng icon indicating copy to clipboard operation
cloud_controller_ng copied to clipboard

Bulk API for process stats

Open stephanme opened this issue 1 year ago • 1 comments

Issue

There is no bulk api to fetch process stats of multiple processes at once.

Context

cf a in a space that contains ~800 apps takes ~5min. One reason is that for every app process a single CF API request (GET /v3/processes/<process-guid>/stats) is executed to fetch the process stats.

There are additional reasons for the long execution time like https://github.com/cloudfoundry/cli/issues/2733.

Possible Fix

Implement a list endpoint to fetch process stats for multiple processes, e.g.

  • GET /v3/processes/stats with query params similar to GET /v3/processes, or
  • add an include=stats parameter to GET /v3/processes

For good performance, it would be desirable that the cloud controller can fetch the process stats from log-cache using a bulk api as well.

stephanme avatar Jan 11 '24 16:01 stephanme

Usage of stats endpoint in CF CLI

  • cf apps command uses the state to display how many instances are running
  • cf app <app_name> displays various information like state and metrics of instances

Current stats endpoint

The current stats endpoint does the following:

  • Fetch DesiredLRPs from Diego for the given process_guid (metric tag (for filtering LogCache envelopes by process_guid) and isolation segment is being used/ can be removed I think, as this info is also available in the CCDB)
  • Metrics are being fetched from LogCache using the GET API
  • ActualLRPs are being fetched from Diego, which contain the state and network information etc.

Current limitations of downstream services

Diego

Currently we can only fetch ActualLRPs for one process, not for several.

LogCache

  • Using the GET API we can fetch metrics for all processes and their instances for one app
  • The PromQL API supports fetching one metric for many apps. We have 8 metrics though, so it would result in 8 requests to LogCache. Calls to this API are way more expensive compared to Diego API.

Possible improvements

Bulk endpoint to fetch state of instances for a whole space

To improve the performance of the cf apps command we could offer a bulk endpoint, which supports fetching the state (and the other info from Diego) for all instances in one space.

GET /v3/processes/state?space_guids=<space_guid>,...

  • The user can specify space_guids as query parameter
  • This could be extended in the future by app_guids, process_guids query parameters if needed
  • Response contains info from Diego and CCDB, but not LogCache
  • Prerequisite is a bulk endpoint for ActualLRPs in Diego

Bulk endpoint for stats for all instances of an app

This endpoint would be rather for convenience as usually apps only have one or two processes (based on my experience).

GET /v3/apps/:guid/processes/stats

  • By binding this endpoint to one app, we do not need a bulk endpoint for metrics from LogCache, as we can already fetch metrics for all processes for an app.
  • Endpoint would show exactly the same data as the current stats endpoint, but including the process_guid.
  • Prerequisite is also a bulk endpoint for ActualLRPs

svkrieger avatar Jun 03 '25 14:06 svkrieger

As the Diego folks already implemented the endpoint for fetching ActualLRPs for several process_guids and that resolves our prerequisites, I'd like to get a discussion going about the state endpoint. I'd prefer to offer only fetching the state (and all other data from the actual lrp) for ONE space and not several. This would limit the load we put on CC and Diego. Also there is currently only a use case by the cf cli, which needs the state of all processes for a whole space for the cf apps command.

I see two options where we could put such an endpoint:

Option 1: /v3/processes/state?space_guid=...

[+] Could be extended in the future with query parameters app_guids and process_guids

[-] Query parameter space_guid (singular) does not exist yet in the API

Option 2: /v3/spaces/:space_guid/processes/state

[+] Very clear that state can be only fetched for a whole space

[-] We should also serve /v3/spaces/:space_guid/processes then, as that does not exist yet

Payload

For the payload I'd suggest we offer data similar to the current stats endpoint, but without the fields provided by logcache and adding 2 guids (process, app).

{
  "resources": [
    {
      "type": "web",
      "process_guid": "some-process-guid",              # NEW
      "app_guid": "some-app-guid",                      # NEW
      "index": 0,
      "instance_guid": "77e31085-3d74-4518-70d7-d07f",
      "state": "RUNNING",
      "routable": true,
      "host": "10.0.16.18",
      "instance_internal_ip": "10.255.235.67",
      "uptime": 92329,
      "fds_quota": 16384,
      "isolation_segment": null,
      "details": null,
      "instance_ports": [
        {
          "external": 0,
          "internal": 8080,
          "external_tls_proxy_port": 61000,
          "internal_tls_proxy_port": 61001
        },
        {
          "external": 0,
          "internal": 8080,
          "external_tls_proxy_port": null,
          "internal_tls_proxy_port": 61443
        },
        {
          "external": 0,
          "internal": 2222,
          "external_tls_proxy_port": 61001,
          "internal_tls_proxy_port": 61002
        }
      ],
    }
  ]
}

svkrieger avatar Jul 17 '25 11:07 svkrieger