lxd
lxd copied to clipboard
Avoid race between operation and events listener
Fixes intermittent issues with the clustering_image_refresh
tests (potentially among others as well).
If we implicitly create the events listener after the operation has fired, then there's a chance the operation will have completed and emitted the corresponding event before we even begin listening.
Waiting for an operation can thus potentially block indefinitely, if we first encounter the operation in a running state, but then it completes before we set up the listener.
Instead, we shouldn't try to create an events listener automatically when setting up the wait process as we have no way of determining the current state of an operation yet.
If we want a listener with our operation, we should make sure to set it up before making the operation request.
Hm, too optimistic I guess. Drafting for now.
Okay, so the test failures were due to remotes that were unauthorized to interact with the events API. Previously they all set up their events listeners after the operation events had all been emitted already, so there was no error, but now that the events listener is set up beforehand in every case, they're hitting an unauthorized
error on GET /1.0/events
.
In particular the remote_usage
tests were failing to copy an image to a remote, because that remote wasn't authorized to talk back to the events API. I've worked around it by just removing the events listener from the CreateImage
client function.
I can make all of the client functions skip the event listener and that should solve the problem, but to be honest I'm not too familiar with the events API so I don't exactly know why it's used with these operations in the first place. If there's a good reason to keep them around, then we can try to do something fancy and detect in the client if the action involves a remote, and if so check if that remote is also trusted by us, and only then include the events listener.
The /1.0/events
API is being used as a catch-all in the client so that only one websocket is used at a time. I've encountered issues with it when working with authorization as well. For example, a user with permission to exec into an instance but without permission to view all events in a project is unable to do so via the client. The addition of the skipListener
argument was intended to fix this by instead using /1.0/operations/wait
but unfortunately we have found that long-polling can be very inconsistent (especially when there is an intermediate load-balancer).
Tom and I came up with the idea of adding websocket support to the operation wait endpoint to counteract this (https://warthogs.atlassian.net/browse/LXD-726?atlOrigin=eyJpIjoiMTM1ZjY4NTgyY2MxNDAzOGIwMjk2MTQzODYzZGZkODgiLCJwIjoiaiJ9). It will need more discussion before we implement anything though.
It sounds to me like these connections should actually be trusted by the remote then. Maybe we need an access handler on GET /1.0/events
that allows untrusted requests to collect only operation
type events, since GET /1.0/operations/id/wait
is already untrusted.
So untrusted operation requests instead require a secret passed as a query parameter here: https://github.com/canonical/lxd/blob/0487e51cde8fdac810a6a830ca4e382005b1dab7/lxd/operations.go#L907-L924
Making GET /1.0/events
untrusted by default then would require us to pass in the secret as well (likely as a query parameter) and throw out non-operation type events and any operation events that don't also have the secret in their metadata.
So this would mean that the events API would only serve exactly one operation to an untrusted remote.
@masnax are you going to land the cherry-pick from Incus separately?
@masnax shall we close this for now?
@tomponline As we had discussed a few weeks ago, I've also tested to make sure this doesn't interfere with the VM agent. It seems to work fine for launching / execing into VMs.
@masnax should this be moved back into draft status?
Moved back to draft status as not active.
@masnax shall we close this until you are able to work on it again? To get the open PR list down.