sdk-python icon indicating copy to clipboard operation
sdk-python copied to clipboard

[Feature Request] Allow custom metric buckets

Open TheCodeWrangler opened this issue 10 months ago • 4 comments

Currently the default binning for activity metrics are top out at 60 seconds. This limits my observability into activities which might take a long time or even take more than 60 seconds on P99 metrics.

I would like to have a method of specifying a desired binning strategy on activity metrics.

TheCodeWrangler avatar Feb 26 '25 18:02 TheCodeWrangler

Yes, now that https://github.com/temporalio/sdk-core/pull/844 is available, we can do this here too. I have changed the title.

cretz avatar Feb 26 '25 19:02 cretz

Any estimate on a timeline? Is this something I could help speed along with a PR?

As best I can tell (though I am not very familiar with rust) the sdk-core binding accepts histogram_bucket_overrides

So possibly just python updates in runtime.py

@dataclass(frozen=True)
class PrometheusConfig:
    """Configuration for Prometheus metrics endpoint."""

    bind_address: str
    counters_total_suffix: bool = False
    unit_suffix: bool = False
    durations_as_seconds: bool = False
    histogram_bucket_overrides: dict[str, list[float]] | None = None # New property

    def _to_bridge_config(self) -> temporalio.bridge.runtime.PrometheusConfig:
        return temporalio.bridge.runtime.PrometheusConfig(
            bind_address=self.bind_address,
            counters_total_suffix=self.counters_total_suffix,
            unit_suffix=self.unit_suffix,
            durations_as_seconds=self.durations_as_seconds,
            histogram_bucket_overrides=self.histogram_bucket_overrides or {}, # New arg
        )

I am pretty naive on what might be involved here though.

TheCodeWrangler avatar Feb 27 '25 13:02 TheCodeWrangler

There is no estimate on timeline, but it is on the near-term backlog since it was recently implemented on our Core side. Yes what you have there is mostly correct and we definitely welcome PRs if you want to tackle it (will want a test to confirm it works for built-in and user-defined histograms too, which is usually just as simple as making an HTTP call to the Prometheus endpoint and checking contents).

cretz avatar Feb 27 '25 14:02 cretz

I took a stab in a PR. https://github.com/temporalio/sdk-python/pull/781

This was branched off of 1.10.0 tag because I was not able to get uv build to succeed on the main branch, but was able to get the poetry build working on the tag. This limited me from being able to compile and test cahnges on the main.

Was able to get a test to work and confirm that custom histogram binning worked for a custom metric BUT was not seeing it have an effect on the workflow end-to-end latency. This has me a bit nervoous that the update to workflow or activity binning (which is all i actually care about at the moment) will require a change to the core-sdk.

TheCodeWrangler avatar Mar 03 '25 16:03 TheCodeWrangler

Done in #781, thanks @TheCodeWrangler!

cretz avatar Apr 21 '25 15:04 cretz