multi-node builder should not fallback to detected platform if node with manual platform errors
Description
Repro steps:
- Create a driver with this spec:
{"Name":"remote","Driver":"remote","Nodes":[{"Name":"remote0","Endpoint":"tcp://localhost:1234","Platforms":[{"architecture":"arm64","os":"linux"}],"Flags":null,"DriverOpts":null,"Files":null}
- Send a request to the builder for a different platform
docker buildx build -t out --platform linux/amd64 .
Expected result:
I would expect to fail because there's no available builder that matches the platform
Actual result:
buildx sends the request to the ARM builder
Additional info:
The full problem is more complicated, and involves how the Node fallback logic works when one of the builders is unavailable. Basically we're seeing problems where if platform isn't specified, and one of the builders is unavailable, the request ends up landing on a random Node and with a random Platform.
I think it would be fine if this behavior was opt-in, i.e., there was a driver opt like:
{"Name":"remote","DriverOpts":{"DefaultPlatform": "linux/amd64", "StrictPlatformMatching": true}}
when one of the builders is unavailable.
But in your example you only have a single node builder. I would expect this to fail if you have a multi-node builder, platform for node is manually set to linux/amd64 and that node is unavailable. In that case no fallback to the arm node should happen and it should fail because amd64 node is down.
In all my tests, it silently falls back. I'm not sure I understand all the codepaths. The problem might be this function?
https://github.com/docker/buildx/blob/cb378866587090995fba030bc75a9ca2fb4d8d26/build/build.go#L121
If some of the nodes are in an error state, but there's at least one working Node, then it filters out the bad nodes and "succeeds"
That does not look correct indeed. If the node has set a manual platform, then I think that node should be always used without any fallbacks.
But the problem is that if there is no --platform and the first node is down, then we don't know that the "manual platform" is the native (default) platform for the node. I guess we could just guess that if there is any manual platform set then the first one set is the native.
Do you agree that this issue should just be fixed by updating the node resolution logic, without any new keys like the "StrictPlatformMatching" in your first comment?
Ya that works for me
Proposed changes for clarity:
-
Multi-node cluster where node has set a manual platform, node is in error state and user sets
--platformthat matches manual platform- Old rule:
- Fallback to the next active node that supports platform
- New rule
- Fallback to the next active node that also has same manual platform. Do not fallback to detected platform. Error if no node to fallback.
- Old rule:
-
Multi-node cluster where first node has set a manual platform, user does not set any
--platform- Old rule
- Default platform is the native platform of the first node that has not errored
- New rule:
- Default platform is always the first manual platform of first node, even if that node is in error state
- Old rule
@nicks - did we still want to get this addressed?