lxd icon indicating copy to clipboard operation
lxd copied to clipboard

Avoid race between operation and events listener

Open masnax opened this issue 1 year ago • 7 comments

Fixes intermittent issues with the clustering_image_refresh tests (potentially among others as well).

If we implicitly create the events listener after the operation has fired, then there's a chance the operation will have completed and emitted the corresponding event before we even begin listening.

Waiting for an operation can thus potentially block indefinitely, if we first encounter the operation in a running state, but then it completes before we set up the listener.

Instead, we shouldn't try to create an events listener automatically when setting up the wait process as we have no way of determining the current state of an operation yet.

If we want a listener with our operation, we should make sure to set it up before making the operation request.

masnax avatar Feb 13 '24 17:02 masnax

Hm, too optimistic I guess. Drafting for now.

masnax avatar Feb 13 '24 17:02 masnax

Okay, so the test failures were due to remotes that were unauthorized to interact with the events API. Previously they all set up their events listeners after the operation events had all been emitted already, so there was no error, but now that the events listener is set up beforehand in every case, they're hitting an unauthorized error on GET /1.0/events.

In particular the remote_usage tests were failing to copy an image to a remote, because that remote wasn't authorized to talk back to the events API. I've worked around it by just removing the events listener from the CreateImage client function.

I can make all of the client functions skip the event listener and that should solve the problem, but to be honest I'm not too familiar with the events API so I don't exactly know why it's used with these operations in the first place. If there's a good reason to keep them around, then we can try to do something fancy and detect in the client if the action involves a remote, and if so check if that remote is also trusted by us, and only then include the events listener.

masnax avatar Feb 14 '24 02:02 masnax

The /1.0/events API is being used as a catch-all in the client so that only one websocket is used at a time. I've encountered issues with it when working with authorization as well. For example, a user with permission to exec into an instance but without permission to view all events in a project is unable to do so via the client. The addition of the skipListener argument was intended to fix this by instead using /1.0/operations/wait but unfortunately we have found that long-polling can be very inconsistent (especially when there is an intermediate load-balancer).

Tom and I came up with the idea of adding websocket support to the operation wait endpoint to counteract this (https://warthogs.atlassian.net/browse/LXD-726?atlOrigin=eyJpIjoiMTM1ZjY4NTgyY2MxNDAzOGIwMjk2MTQzODYzZGZkODgiLCJwIjoiaiJ9). It will need more discussion before we implement anything though.

markylaing avatar Feb 14 '24 09:02 markylaing

It sounds to me like these connections should actually be trusted by the remote then. Maybe we need an access handler on GET /1.0/events that allows untrusted requests to collect only operation type events, since GET /1.0/operations/id/wait is already untrusted.

masnax avatar Feb 14 '24 18:02 masnax

So untrusted operation requests instead require a secret passed as a query parameter here: https://github.com/canonical/lxd/blob/0487e51cde8fdac810a6a830ca4e382005b1dab7/lxd/operations.go#L907-L924

Making GET /1.0/events untrusted by default then would require us to pass in the secret as well (likely as a query parameter) and throw out non-operation type events and any operation events that don't also have the secret in their metadata.

So this would mean that the events API would only serve exactly one operation to an untrusted remote.

masnax avatar Feb 15 '24 02:02 masnax

@masnax are you going to land the cherry-pick from Incus separately?

tomponline avatar Feb 20 '24 12:02 tomponline

@masnax shall we close this for now?

tomponline avatar Feb 21 '24 10:02 tomponline

@tomponline As we had discussed a few weeks ago, I've also tested to make sure this doesn't interfere with the VM agent. It seems to work fine for launching / execing into VMs.

masnax avatar Mar 18 '24 19:03 masnax

@masnax should this be moved back into draft status?

tomponline avatar Mar 28 '24 10:03 tomponline

Moved back to draft status as not active.

tomponline avatar Apr 05 '24 07:04 tomponline

@masnax shall we close this until you are able to work on it again? To get the open PR list down.

tomponline avatar Apr 11 '24 09:04 tomponline