buildx multi-node builder should not fallback to detected platform if node with manual platform errors

Description

Repro steps:

Create a driver with this spec:

{"Name":"remote","Driver":"remote","Nodes":[{"Name":"remote0","Endpoint":"tcp://localhost:1234","Platforms":[{"architecture":"arm64","os":"linux"}],"Flags":null,"DriverOpts":null,"Files":null}

Send a request to the builder for a different platform

docker buildx build -t out --platform linux/amd64 .

Expected result:

I would expect to fail because there's no available builder that matches the platform

Actual result:

buildx sends the request to the ARM builder

Additional info:

The full problem is more complicated, and involves how the Node fallback logic works when one of the builders is unavailable. Basically we're seeing problems where if platform isn't specified, and one of the builders is unavailable, the request ends up landing on a random Node and with a random Platform.

I think it would be fine if this behavior was opt-in, i.e., there was a driver opt like:

{"Name":"remote","DriverOpts":{"DefaultPlatform": "linux/amd64", "StrictPlatformMatching": true}}

Nov 15 '23 16:11 nicks

when one of the builders is unavailable.

But in your example you only have a single node builder. I would expect this to fail if you have a multi-node builder, platform for node is manually set to linux/amd64 and that node is unavailable. In that case no fallback to the arm node should happen and it should fail because amd64 node is down.

Nov 15 '23 17:11 tonistiigi

In all my tests, it silently falls back. I'm not sure I understand all the codepaths. The problem might be this function?

https://github.com/docker/buildx/blob/cb378866587090995fba030bc75a9ca2fb4d8d26/build/build.go#L121

If some of the nodes are in an error state, but there's at least one working Node, then it filters out the bad nodes and "succeeds"

Nov 15 '23 21:11 nicks

That does not look correct indeed. If the node has set a manual platform, then I think that node should be always used without any fallbacks.

But the problem is that if there is no --platform and the first node is down, then we don't know that the "manual platform" is the native (default) platform for the node. I guess we could just guess that if there is any manual platform set then the first one set is the native.

Nov 15 '23 21:11 tonistiigi

Do you agree that this issue should just be fixed by updating the node resolution logic, without any new keys like the "StrictPlatformMatching" in your first comment?

Nov 15 '23 21:11 tonistiigi

Ya that works for me

Nov 15 '23 23:11 nicks

Proposed changes for clarity:

Multi-node cluster where node has set a manual platform, node is in error state and user sets --platform that matches manual platform
- Old rule:
  - Fallback to the next active node that supports platform
- New rule
  - Fallback to the next active node that also has same manual platform. Do not fallback to detected platform. Error if no node to fallback.
Multi-node cluster where first node has set a manual platform, user does not set any --platform
- Old rule
  - Default platform is the native platform of the first node that has not errored
- New rule:
  - Default platform is always the first manual platform of first node, even if that node is in error state

Nov 17 '23 00:11 tonistiigi

@nicks - did we still want to get this addressed?

Jul 11 '24 16:07 thompson-shaun