lxd
lxd copied to clipboard
Race between untrusted operations and events websocket listener
An issue causing intermittent test failures in the clustering image refresh tests was fixed by https://github.com/canonical/lxd/pull/12754/commits/e7b3c92f2c5ea0b9de9668f06a17a3fa77b239c0
The problem was a race where an operation would complete after setting up the events websocket listener, but before the listener is added to the event handler, and so we would block while waiting for the operation complete event. The fix works by reducing the time between the creation of the listener and adding it to the handler.
While this works, I think it is a bit obscure and overly couples the client with the order of operations on the server. It made more sense to me to just ensure we set up the client and all its listeners before we make any request to create an operation, which those listeners are meant to track. While testing this over here: https://github.com/canonical/lxd/pull/12885 I found a separate race that is connected to the one solved above, resulting in this error: https://github.com/canonical/lxd/actions/runs/7920353180/job/21623203927?pr=12885#step:13:15409
This happens if we run lxc image copy localhost:${fingerprint} lxd2: --mode=push
to copy an image across 2 remotes.
Internally, we make the following POST /1.0/images/{fingerprint}/export
to localhost
here. From this endpoint we make a CreateImage
request to lxd2
without a client keypair. The request is untrusted since it's run from the server, but when we add lxd2
as a remote, it only trusts the client, not the server.
The race comes in here when CreateImage
performs the request which spawns an operation on lxd2
. Around this time, it also sets up an event listener which requires trust. If the operation completes before we set up the listener then everything works fine, but if the listener manages to start up first, then the request will fail with an unauthorized
error.
I think a simple solution in this case would be to just ensure that we don't spawn an events listener if the client hasn't been assigned a keypair. We can infer from this that the request we are trying to make is going to be untrusted, so we shouldn't try to separately connect to a trusted endpoint.
I think a simple solution in this case would be to just ensure that we don't spawn an events listener if the client hasn't been assigned a keypair. We can infer from this that the request we are trying to make is going to be untrusted, so we shouldn't try to separately connect to a trusted endpoint.
This makes sense. Although I am wondering why the image copy
succeeds if the two servers dont trust each other and there is no temporary key?
I think a simple solution in this case would be to just ensure that we don't spawn an events listener if the client hasn't been assigned a keypair. We can infer from this that the request we are trying to make is going to be untrusted, so we shouldn't try to separately connect to a trusted endpoint.
This makes sense. Although I am wondering why the
image copy
succeeds if the two servers dont trust each other and there is no temporary key?
There is a secret set in the request header or operation metadata which is manually verified in the corresponding images
and operations
endpoints, if that's what you mean by using a temporary key.
I looked into handling the secret in the events endpoints as well, but I couldn't figure out an elegant solution for verifying it. Best I thought of is to intercept any events going through the websocket and only propagate the events which contain an operation that has the secret in its metadata. That seems overly complex to me.
@masnax ok in that case im not following what you mean by:
I think a simple solution in this case would be to just ensure that we don't spawn an events listener if the client hasn't been assigned a keypair
@masnax ok in that case im not following what you mean by:
I think a simple solution in this case would be to just ensure that we don't spawn an events listener if the client hasn't been assigned a keypair
The client function which spawns the racy event listener is CreateImage. It sets up an http.Request with the secret in its header. Performing the request spawns an operation.
The request that spawns the operation bypasses the the normal auth process of comparing the request peer certs to the locally trusted certificates, and instead verifies the secret, which was sent to that server beforehand.
CreateImage
returns an operation struct. When we call operation.Wait
, we create an events websocket listener which will listen for any type of event (logging, operations, etc). This listener hits the /1.0/events
endpoint which only supports the traditional auth method.
So what I'm suggesting is to not leave spawning the listener until after the operation is created and we call operation.Wait
, but instead create it before we make the request that spawns the operation. We do this under the constraint that the underlying ProtocolLXD
actually has a client keypair assigned to it, and otherwise set skipListener
to true.
@masnax ok lets have a look at a PR that does this when you get a chance. Thanks
Filtering events by what the user can see, ala https://github.com/lxc/incus/issues/292 would be helpful here.
@markylaing says
Ultimately, filtering events by what the user is able to view would solve the PR. Perhaps we can add an operation_secret query parameter and allow untrusted requests on the events endpoint.