lxd icon indicating copy to clipboard operation
lxd copied to clipboard

API request metrics for disaster recovery

Open hamistao opened this issue 1 year ago • 15 comments

This introduces API rates metrics for Disaster Recovery, the spec related to this should be published once this is merged.

Here is a raw output example of /1.0/metrics with the new metrics: https://pastebin.canonical.com/p/wvhcnK7tq6/

hamistao avatar Jul 26 '24 05:07 hamistao

Heads up @mionaalex - the "Documentation" label was applied to this issue.

github-actions[bot] avatar Jul 26 '24 05:07 github-actions[bot]

@tomponline Please take a look at this when you can, I would appreciate some comments on the overall structure of the solution so far.

hamistao avatar Jul 26 '24 05:07 hamistao

This is uncomplete, to be done: 1.Handle metric value updates on asynchronous operations 2.Fine tune entity_type assignment 3.Fix GH tests; 4.Manual tests 5.Docs

You can convert this to a checklist in GH so we can see progress as you tick them off.

tomponline avatar Jul 26 '24 07:07 tomponline

@mseralessandri @tomponline This is ready for a full review. The categorization of the endpoints on entity types looks like this, those marked with * are endpoints I categorized on my own as they were not spefically mentioned during the discussions, if you think I should leave those separated or on other category let me know and I will make the changes:

TypeInstance for /{version}/containers, /{version}/virtual-machines and /{version}/instances endpoints.
TypeNetwork for /{version}/network-zones, /{version}/network-allocations, /{version}/network-acls and /{version}/networks endpoints.
TypeStoragePool for /{version}/storage-pools and /{version}/storage-volumes endpoints.
TypeIdentity for /{version}/auth and /{version}/certificates* endpoints.
TypeImage for /{version}/images endpoints.
TypeNode for /{version}/cluster endpoints.
TypeProject for /{version}/projects endpoints.
TypeProfile for /{version}/profiles endpoints.
TypeWarning for /{version}/warnings endpoints.
TypeOperation for /{version}/operations endpoints.
TypeServer for /{version}/events, /{version}/metrics, /, /1.0/, /1.0/events, /1.0/internal and /{version}/resources endpoints.*

@tomponline There is one caveat that should be mentioned. I am using the 400 status on operations to derive if the request result is a server error. But perhaps the 400 status is too broad and may also include some types of client errors (e.g. trying to add a block device to a container). I am not sure if I should handle that differently, maybe perform a more intricate analysis of the operation instead of just checking the status code.

hamistao avatar Jul 30 '24 21:07 hamistao

@hamistao as discussed in our 1:1 lets go with storing a "record request result" callback function in the request's context (added by the middleware), that the operation can then detect the presence of and call it when the operation is completed.

This callback function needs to ensure it can only be executed once (perhaps using sync.Once) and subsequent calls to it will be no-ops.

tomponline avatar Aug 02 '24 07:08 tomponline

@tomponline We can proceed this way if you'd like, but I am afraid this approach still leaves room for the operation to finish before setting the callback function since the start of the operation in triggered inside the handler and therefore before the middleware can set this callback function (the middleware would need the status code to determine wether the callback function should be set so could only be performed after the handler).

To work around this I came up with two different approaches:

  1. Adding the callback function inside OperationResponse instead of in the middleware.

  2. Moving the operation start from Render to the middleware.

As for avoid waiting for Operations inside handlers or not providing the request to the operation if doing so, it is possible but this is useful for event logging so maybe it would be nice to keep this behavior as is.

We can have a quick call to discuss all this if you'd like.

hamistao avatar Aug 02 '24 14:08 hamistao

We can have a quick call to discuss all this if you'd like.

Yes we can, as I don't follow this:

but I am afraid this approach still leaves room for the operation to finish before setting the callback function since the start of the operation in triggered inside the handler and therefore before the middleware can set this callback function (the middleware would need the status code to determine wether the callback function should be set so could only be performed after the handler).

The context var should be set before the API handler function is started (the one that may or may not start an operation) in the middleware that is executed first in order to increment the "in progress" counter.

tomponline avatar Aug 02 '24 14:08 tomponline

Setting the callback function before the main handler is possible but could run into the problem of being called too early by an operation that is created and completed inside the main handler.

hamistao avatar Aug 02 '24 14:08 hamistao

Setting the callback function before the main handler is possible but could run into the problem of being called too early by an operation that is created and completed inside the main handler.

Thats literally the whole point of what we are trying to enable :)

We already discussed that if the operations in the API handler itself shouldnt mark the request as done when they finish they shouldnt be passed the request var.

tomponline avatar Aug 02 '24 14:08 tomponline

Oh sorry, I thought we hadn't yet decided on what to do with the request var, if that is the case then the callback approach should work just fine indeed. In any case, if and when you have a few minutes please ping me so we can briefly discuss this.

hamistao avatar Aug 02 '24 14:08 hamistao

No problem, yes please can you propose a new day/time for our next 1:1

tomponline avatar Aug 02 '24 15:08 tomponline

@tomponline Just pushed the discussed changes, feel free to take a look when you can.

hamistao avatar Aug 09 '24 03:08 hamistao

@ru-fu Please take a look at the doc changes here when you can. Thanks!

hamistao avatar Aug 16 '24 03:08 hamistao

@tomponline This is ready for another look!

hamistao avatar Aug 23 '24 08:08 hamistao

@mseralessandri please can you look at the doc changes here and let us know if this aligns with your expectations? Thanks

tomponline avatar Aug 27 '24 10:08 tomponline

@tomponline This is ready for another look.

The changes requested by @markylaing were all addressed as well.

@mseralessandri @mionaalex Can also review the doc changes if needed.

hamistao avatar Sep 05 '24 05:09 hamistao

@tomponline The requested changes were made.

hamistao avatar Sep 06 '24 07:09 hamistao