prom_ex icon indicating copy to clipboard operation
prom_ex copied to clipboard

[FEATURE] Delayed startup for polled/manual metrics

Open LostKobrakai opened this issue 2 years ago • 5 comments

Is your feature request related to a problem? Please describe. PromEx does start all metrics collection together, but manual or polled metrics might need processes started later in the supervision tree before being able to execute successfully. There's the :manual_metrics_start_delay config, but this neither supports polled metrics nor is is it really bullet prove given startup may not take a fixed amount of time.

Describe the solution you would like to see A way to have polled and the initial manual metric fetching be delayed without a timeout.

How would you expect this feature to work Both polled and manual metrics have a group name. It would be interesting to just configure starting them like this for the supervision tree.

children = [
  {MyApp.PromEx, polled: :delay, manual: :delay},
  …,
  …,
  {MyApp.PromEx.StartMetrics, polled: ["name_a", "name_b"], manual: :all},
  …,
  {MyApp.PromEx.StartMetrics, polled: ["name_c"]}
]

Besides the plain problem statement, this also allows granular startup based on individual metric groups' dependencies, over the "one fixed deley" of :manual_metrics_start_delay.

Additional context

Edit: This approach would also allow clean shutdowns if metric collection is stopped before dependencies of it are stopped.

LostKobrakai avatar Feb 10 '22 12:02 LostKobrakai

Totally missed this issue. Sorry about that!

I am not sure that I follow 100% based on the example. But it sounds like you would defer those groups from starting and instead you would manually activate them based on the StartMetrics child processes?

akoutmos avatar Feb 25 '22 04:02 akoutmos

Yeah, that's the idea. We have quite a few periodic or manual metrics, which require our repo to be started, but of course we don't want to miss telemetry of the repo itself.

LostKobrakai avatar Feb 25 '22 08:02 LostKobrakai

For the missed metrics problem, I would suggest adding PromEx as the first item in your application.ex supervision tree. IIRC Ecto emits an init event that PromEx attaches to. Phoenix on the other hand does not have any init events (I should probably make a PR for that to Phoenix now that I mention it), and PromEx jumps through some hoops to get the config metrics from Phoenix.

That aside, I can totally see this being added to PromEx as it is a useful feature. Is this something that you would be interested in tackling?

akoutmos avatar Feb 25 '22 15:02 akoutmos

If you put prom_ex first in your application.ex, but have a polling metric that relies on the Ecto repo being available, prom_ex will run the polling code immediately, resulting in a runtime error like this:

[error] Error when calling MFA defined by measurement: MyApp.SalesOrders.PromExPlugin :execute_sales_orders_metrics []
Class=:error
Reason=%RuntimeError{
  message: "could not lookup Ecto repo MyApp.Repo because it was not started or it does not exist"
}

This can be mitigated by checking if the repo is running in your polling callback, or switching the order in application.ex, but that has side-effects.

mikl avatar Aug 16 '22 19:08 mikl

Should someone want simply skip the first metrics run (to avoid the error), you can add something like this to your metrics callback:

  if Enum.member?(Ecto.Repo.all_running(), MyApp.Repo) do
      :telemetry.execute(
        @event_name,
        %{count: Function.to_get_count()},
        %{}
      )
    end

mikl avatar Aug 16 '22 19:08 mikl