spire icon indicating copy to clipboard operation
spire copied to clipboard

Azure NodeAttestor does not work with VM scale sets

Open azdagron opened this issue 4 years ago • 7 comments

The system-assigned identity for all VM's in the scale set is the same and thus all agent's in the scale set get the same identity.

azdagron avatar Sep 14 '21 14:09 azdagron

There is a similar problem with node pools.

azdagron avatar Sep 14 '21 14:09 azdagron

To provide more details. The SPIFFE ID for Azure MSI is in the following format:

spiffe://<trust domain>/spire/agent/azure_msi/<tenant_id>/<principal_id>

the problem is that in our test case the principal_id corresponds to the agent pool, and all the nodes in the nodepool end up with the same principal_id, and since only one, the first one can claim this identity, all the following nodes are being rejected by the SPIRE server:

time="2021-09-11T17:17:29Z" level=error msg="Agent crashed" error="failed to get SVID: error getting attestation response from SPIRE server: rpc error: code = Internal desc = failed to attest: azure-msi: MSI token has already been used to attest an agent"
trusted
on Server:
time="2021-09-11T17:17:27Z" level=error msg="Failed to attest" caller-addr="xxx.xxx.xxx.xxx:57104" error="rpc error: code = Unknown desc = azure-msi: MSI token has already been used to attest an agent" method=AttestAgent node_attestor_type=azure_msi service=agent.v1.Agent subsystem_name=api

As a result, we can only run the agents successfully when the nodepool_size =1.

mrsabath avatar Sep 14 '21 16:09 mrsabath

To provide a little background: SPIRE Agents need to be uniquely identifiable, and today we generally support only one agent per node. We originally chose the MSI Principal ID to provide uniqueness for this purpose since in our testing (and perhaps at the time), each Principal ID was scoped to an individual node.

We chose to use MSI, in general, as the basis for Azure node attestation because it was the only viable option at the time. It's certainly not ideal, and IIRC it's even an opt-in feature on a per-node basis?

At any rate, there definitely seems to be some confusion here because we have assumed that this can uniquely identify a node, however Azure sees things differently. In my opinion, this particular problem is a layering violation - the MSI is attempting to be application-scoped when it is in fact node-scoped.

Is there an alternative to MSI for Azure node attestation?

evan2645 avatar Sep 20 '21 23:09 evan2645

There may be support for a new token type in Azure that could address this problem. If anyone has an Azure environment where they'd be able to research this, that would be helpful.

rturner3 avatar May 08 '23 19:05 rturner3

This issue is stale because it has been open for 365 days with no activity.

github-actions[bot] avatar May 07 '24 22:05 github-actions[bot]

This issue was closed because it has been inactive for 30 days since being marked as stale.

github-actions[bot] avatar Jun 07 '24 22:06 github-actions[bot]

Hey @azdagron! I'm interested in adding this support and getting Azure attestation to work for scale set VMs. I've done some investigation so I'd like to get this conversation going and get your thoughts on this.

There are actually a couple of issues here:

  • The VM Azure resource type we use to query on the SPIRE server side is different than the scale set VMs type (we currently use Microsoft.Compute/virtualMachines whereas scale set VMs have the type microsoft.compute/virtualmachinescalesets/virtualmachines)
  • As you pointed out, azure identities get assigned to scale sets which means all nodes/VMs within a scale set get the same identity/or even identities (in the case of user-assigned identities you can actually assign multiple identities to the scale set)

So here is what I propose high level:

  1. On the agent side, we get the agent to not only send a token but also the metadata belonging to the VM.
  2. On the server side, we validate the incoming token and extract the principal ID (which we already do) we then query azure for scale set VMs that are assigned that token AND have the resource name/or ID that was sent as a part of the VM metadata (we will have this piece of information about the VM as a part of azure.MSIAttestationData as a result of the change in # 1 above) this should return either 0 or 1 VM, if we get 1, we attest the workload as we do today etc

In # 2 we also use the VM name (or probably the resource ID) in forming the SPIFFE ID and this way we ensure that each VM attests only once.

There is also the question of whether that token belongs to a system-assigned identity or a user-assigned one, that matters because the way you query azure is different for resources that have system-assigned vs user-assigned identities, but maybe we leave this aside for now and focus on agreeing on a reasonable path for scale set VMs to be attestable at all.

cc: @evan2645 @rturner3

Also @mrsabath it's been a while since you made the comment above, but I'm curious if you were ever able to get around this issue somehow.

moe-omar avatar Oct 04 '24 23:10 moe-omar

This issue is stale because it has been open for 365 days with no activity.

github-actions[bot] avatar Oct 06 '25 22:10 github-actions[bot]

This issue was closed because it has been inactive for 30 days since being marked as stale.

github-actions[bot] avatar Nov 05 '25 22:11 github-actions[bot]