refactor(facade): use model config service in provisioner facade
Move agent/provisioner facade over to using model config and info services, instead of pulling model config/info from state.
In service.go, I've added new interfaces ModelConfigService and ModelInfoService, and also moved the existing services into this file (ControllerConfigService, StoragePoolGetter, NetworkService). All accesses of model config now use the model config service. Model information (such as model type, UUID, name).
The tests now set controller/model config keys via the service factory provided by the ApiServerSuite, instead of using the ControllerConfigAttrs and ControllerModelConfigAttrs primitives.
Drive-bys:
- In
provisioner.go, remove the duplicatecontextimport and renamestdcontexttocontext. - Be safer when casting the value in
Config.SSLHostnameVerificationto avoid panics.
Checklist
- [x] Code style: imports ordered, good names, simple structure, etc
- ~[ ] Comments saying why design decisions were made~
- [ ] Go unit tests, with comments saying what you're testing
- ~[ ] Integration tests, with comments saying what you're testing~
- ~[ ] doc.go added or updated in changed packages~
QA steps
TODO
Links
Jira card: JUJU-6056
This facade is used in the containerbroker and provisioner (container init) workers. I'd QA it by adding some container machines and deploying into it, something like:
juju add-machine lxd
juju deploy --to lxd:0
You also have to update the facade's tests, they are failing (at least) with a nil pointer reference since you are using the modelconfigservice and not injecting it in the tests.
There is a few other things to touch up in this file as we go along.
provisioner.go
func (api *ProvisionerAPI) ModelUUID(ctx stdcontext.Context) params.StringResult {
return params.StringResult{Result: api.st.ModelUUID()}
}
Can we please change this over to use the model info service (https://github.com/juju/juju/blob/main/domain/model/service/modelservice.go).
I see a potential complication that this facade is used in multiple places.
For example, the ContainerManagerConfig method is used by the provisioner worker and the containerbroker worker. Ideally we'd want to avoid copy-pasting this method - so would we make one worker depend on the other, or put this code in a "common" package that both could import, or something else?
The ContainerConfig method is used by the LXD container broker. Is it (or will it be) possible to supply the model/controller config services to this broker?
And I think ProvisioningInfo still depends too much on state to be moved into a worker
The strategy is to push logic from the facades into the service layer so that to the greatest extent possible, facades only deal with:
- auth
- encoding/decoding wire types
- dispatch to the services
With this strategy there is no issue with copy/pasting or duplication, because both sites ultimately recruit the same service method with the API facade being... a facade.
You won't be able to use services from the container broker directly, as it resides on leaf agents.
I see that in this case, the service methods are called inside the facade methods, rather than by the agent directly, so what I said about duplicated effort isn't really a concern right now. I will review this today.
Add me as a reviewer on GitHub, and to the Jira card. The Jira card should be linked in the description, not just noted.
In debug-log, I see the lines:
machine-0: 13:09:17 ERROR juju.worker.dependency "broker-tracker" manifold worker returned unexpected error: no container types determined
machine-0: 13:09:17 ERROR juju.worker.dependency "lxd-container-provisioner" manifold worker returned unexpected error: container types not yet available
Both of these error messages can be traced back to the SupportedContainers method, it seems I've broken that somehow.
EDIT: the facade method appears to be working fine. However, the machine.SupportedContainers value is not being set in state, as it should be.
EDIT2: the lxd-container-provisioner line is also observed on main, however what's new with this patch is the broker-tracker shutting down:
machine-1: 13:56:16 DEBUG juju.worker.dependency "broker-tracker" manifold worker stopped: no container types determined
stack trace:
github.com/juju/juju/internal/worker/containerbroker.NewTracker:112: no container types determined
github.com/juju/juju/cmd/jujud-controller/agent/machine.IAASManifolds.Manifold.func7:82:
github.com/juju/juju/internal/worker/fortress.Occupy:63:
github.com/juju/juju/agent/engine.Housing.Decorate.occupyStart.func1:95:
...
machine-1: 13:56:16 ERROR juju.worker.dependency "broker-tracker" manifold worker returned unexpected error: no container types determined
Possibly this is what's sinking the machine.
I don't think that explains the loss of the agent. Remember, this works on HEAD.
I don't think that explains the loss of the agent. Remember, this works on
HEAD.
So I have spent a lot of time diagnosing this. TLDR: The root of the problem is that we are not auto-configuring "container-networking-method" in model config.
The long story: The EC2 instance loses network connectivity. Diagnosing this was a right pain, but I had to add the machine first, connect to it over ssh, set a password for the ubuntu user, then connect via the EC2 serial console and login with that password.
Why didn't it have network connectivity after the lxd container began provisioning?
The reason the EC2 instance lost connectivity, is because a bridge interface was being created with a random mac address (replacing the original interface), this meant that the VPC DHCP was unable to respond to an unknown mac address and allocate addresses to it.
A solution here was to patch internal/network/netplan/netplan.go to use the mac address of the device being matched by the original interface.
Juju was never meant to be creating a bridge interface on EC2 (because AWS doesn't, at least out of the box, support DHCP for this). This led me to container-networking-method model config value being the root cause of this failure, setting to local after creating the model solves the issue.
As for bridge networking, it probably makes sense to do something to the effect of this patch, as other environments might have similar issues.
internal/network/netplan/netplan.go patch:
// BridgeEthernetById takes a deviceId and creates a bridge with this device
// using this devices config
func (np *Netplan) BridgeEthernetById(deviceId string, bridgeName string) (err error) {
ethernet, ok := np.Network.Ethernets[deviceId]
if !ok {
return errors.NotFoundf("ethernet device with id %q for bridge %q", deviceId, bridgeName)
}
shouldCreate, err := np.shouldCreateBridge(deviceId, bridgeName)
if !shouldCreate {
// err may be nil, but we shouldn't continue creating
return errors.Trace(err)
}
- np.createBridgeFromInterface(bridgeName, deviceId, ðernet.Interface)
+ // If the ethernet device was being matched on mac but did not specify a mac address for
+ // the interface, we need to copy the matched mac address to the bridge interface macaddress
+ // field, otherwise on AWS the bridge interface will not get addresses from dhcp.
+ ethInf := ethernet.Interface
+ macaddressSet := false
+ if mac, ok := ethernet.Match["macaddress"]; ok && ethInf.MACAddress == "" {
+ macaddressSet = true
+ ethInf.MACAddress = mac
+ }
+ np.createBridgeFromInterface(bridgeName, deviceId, ðInf)
+ if macaddressSet {
+ ethInf.MACAddress = ""
+ }
+ ethernet.Interface = ethInf
np.Network.Ethernets[deviceId] = ethernet
return nil
}
Before the patch:
ubuntu@ip-172-31-54-202:/etc/netplan$ ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: enp39s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc mq master br-enp39s0 state UP group default qlen 1000
link/ether 02:11:27:5e:dd:a7 brd ff:ff:ff:ff:ff:ff
3: br-enp39s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue state UP group default qlen 1000
link/ether 52:a0:79:ef:19:39 brd ff:ff:ff:ff:ff:ff
inet6 fe80::50a0:79ff:feef:1939/64 scope link
valid_lft forever preferred_lft forever
After the patch:
ubuntu@ip-172-31-54-202:/etc/netplan$ ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: enp39s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc mq master br-enp39s0 state UP group default qlen 1000
link/ether 02:11:27:5e:dd:a7 brd ff:ff:ff:ff:ff:ff
3: br-enp39s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue state UP group default qlen 1000
link/ether 02:11:27:5e:dd:a7 brd ff:ff:ff:ff:ff:ff
inet 172.31.54.202/20 metric 100 brd 172.31.63.255 scope global dynamic br-enp39s0
valid_lft 3597sec preferred_lft 3597sec
inet6 fe80::50a0:79ff:feef:1939/64 scope link
valid_lft forever preferred_lft forever
@manadart perhaps we force container-networking-method to local until we get a chance to reimplement the auto configuration?
@manadart perhaps we force
container-networking-methodtolocaluntil we get a chance to reimplement the auto configuration?
I will look at this during the current pulse.
@manadart we have already re-implemented auto config for container networking. Checkout #17574
/merge