juju icon indicating copy to clipboard operation
juju copied to clipboard

refactor(facade): use model config service in provisioner facade

Open barrettj12 opened this issue 1 year ago • 8 comments

Move agent/provisioner facade over to using model config and info services, instead of pulling model config/info from state.

In service.go, I've added new interfaces ModelConfigService and ModelInfoService, and also moved the existing services into this file (ControllerConfigService, StoragePoolGetter, NetworkService). All accesses of model config now use the model config service. Model information (such as model type, UUID, name).

The tests now set controller/model config keys via the service factory provided by the ApiServerSuite, instead of using the ControllerConfigAttrs and ControllerModelConfigAttrs primitives.

Drive-bys:

  • In provisioner.go, remove the duplicate context import and rename stdcontext to context.
  • Be safer when casting the value in Config.SSLHostnameVerification to avoid panics.

Checklist

  • [x] Code style: imports ordered, good names, simple structure, etc
  • ~[ ] Comments saying why design decisions were made~
  • [ ] Go unit tests, with comments saying what you're testing
  • ~[ ] Integration tests, with comments saying what you're testing~
  • ~[ ] doc.go added or updated in changed packages~

QA steps

TODO

Links

Jira card: JUJU-6056

barrettj12 avatar May 24 '24 10:05 barrettj12

This facade is used in the containerbroker and provisioner (container init) workers. I'd QA it by adding some container machines and deploying into it, something like:

juju add-machine lxd
juju deploy --to lxd:0

You also have to update the facade's tests, they are failing (at least) with a nil pointer reference since you are using the modelconfigservice and not injecting it in the tests.

nvinuesa avatar May 27 '24 08:05 nvinuesa

There is a few other things to touch up in this file as we go along.

provisioner.go

func (api *ProvisionerAPI) ModelUUID(ctx stdcontext.Context) params.StringResult {
	return params.StringResult{Result: api.st.ModelUUID()}
}

Can we please change this over to use the model info service (https://github.com/juju/juju/blob/main/domain/model/service/modelservice.go).

tlm avatar May 28 '24 00:05 tlm

I see a potential complication that this facade is used in multiple places.

For example, the ContainerManagerConfig method is used by the provisioner worker and the containerbroker worker. Ideally we'd want to avoid copy-pasting this method - so would we make one worker depend on the other, or put this code in a "common" package that both could import, or something else?

The ContainerConfig method is used by the LXD container broker. Is it (or will it be) possible to supply the model/controller config services to this broker?

And I think ProvisioningInfo still depends too much on state to be moved into a worker

barrettj12 avatar Jun 07 '24 04:06 barrettj12

The strategy is to push logic from the facades into the service layer so that to the greatest extent possible, facades only deal with:

  • auth
  • encoding/decoding wire types
  • dispatch to the services

With this strategy there is no issue with copy/pasting or duplication, because both sites ultimately recruit the same service method with the API facade being... a facade.

You won't be able to use services from the container broker directly, as it resides on leaf agents.

I see that in this case, the service methods are called inside the facade methods, rather than by the agent directly, so what I said about duplicated effort isn't really a concern right now. I will review this today.

Add me as a reviewer on GitHub, and to the Jira card. The Jira card should be linked in the description, not just noted.

manadart avatar Jun 07 '24 04:06 manadart

In debug-log, I see the lines:

machine-0: 13:09:17 ERROR juju.worker.dependency "broker-tracker" manifold worker returned unexpected error: no container types determined
machine-0: 13:09:17 ERROR juju.worker.dependency "lxd-container-provisioner" manifold worker returned unexpected error: container types not yet available

Both of these error messages can be traced back to the SupportedContainers method, it seems I've broken that somehow.

EDIT: the facade method appears to be working fine. However, the machine.SupportedContainers value is not being set in state, as it should be.

EDIT2: the lxd-container-provisioner line is also observed on main, however what's new with this patch is the broker-tracker shutting down:

machine-1: 13:56:16 DEBUG juju.worker.dependency "broker-tracker" manifold worker stopped: no container types determined
stack trace:
github.com/juju/juju/internal/worker/containerbroker.NewTracker:112: no container types determined
github.com/juju/juju/cmd/jujud-controller/agent/machine.IAASManifolds.Manifold.func7:82: 
github.com/juju/juju/internal/worker/fortress.Occupy:63: 
github.com/juju/juju/agent/engine.Housing.Decorate.occupyStart.func1:95: 
...
machine-1: 13:56:16 ERROR juju.worker.dependency "broker-tracker" manifold worker returned unexpected error: no container types determined

Possibly this is what's sinking the machine.

barrettj12 avatar Jun 10 '24 01:06 barrettj12

I don't think that explains the loss of the agent. Remember, this works on HEAD.

manadart avatar Jun 10 '24 13:06 manadart

I don't think that explains the loss of the agent. Remember, this works on HEAD.

So I have spent a lot of time diagnosing this. TLDR: The root of the problem is that we are not auto-configuring "container-networking-method" in model config.

The long story: The EC2 instance loses network connectivity. Diagnosing this was a right pain, but I had to add the machine first, connect to it over ssh, set a password for the ubuntu user, then connect via the EC2 serial console and login with that password.

Why didn't it have network connectivity after the lxd container began provisioning? The reason the EC2 instance lost connectivity, is because a bridge interface was being created with a random mac address (replacing the original interface), this meant that the VPC DHCP was unable to respond to an unknown mac address and allocate addresses to it. A solution here was to patch internal/network/netplan/netplan.go to use the mac address of the device being matched by the original interface.

Juju was never meant to be creating a bridge interface on EC2 (because AWS doesn't, at least out of the box, support DHCP for this). This led me to container-networking-method model config value being the root cause of this failure, setting to local after creating the model solves the issue.

As for bridge networking, it probably makes sense to do something to the effect of this patch, as other environments might have similar issues.

internal/network/netplan/netplan.go patch:

// BridgeEthernetById takes a deviceId and creates a bridge with this device
// using this devices config
func (np *Netplan) BridgeEthernetById(deviceId string, bridgeName string) (err error) {
	ethernet, ok := np.Network.Ethernets[deviceId]
	if !ok {
		return errors.NotFoundf("ethernet device with id %q for bridge %q", deviceId, bridgeName)
	}
	shouldCreate, err := np.shouldCreateBridge(deviceId, bridgeName)
	if !shouldCreate {
                // err may be nil, but we shouldn't continue creating
                return errors.Trace(err)
        }
-       np.createBridgeFromInterface(bridgeName, deviceId, &ethernet.Interface)
+       // If the ethernet device was being matched on mac but did not specify a mac address for
+       // the interface, we need to copy the matched mac address to the bridge interface macaddress
+       // field, otherwise on AWS the bridge interface will not get addresses from dhcp.
+       ethInf := ethernet.Interface
+       macaddressSet := false
+       if mac, ok := ethernet.Match["macaddress"]; ok && ethInf.MACAddress == "" {
+               macaddressSet = true
+               ethInf.MACAddress = mac
+       }
+       np.createBridgeFromInterface(bridgeName, deviceId, &ethInf)
+       if macaddressSet {
+               ethInf.MACAddress = ""
+       }
+       ethernet.Interface = ethInf
        np.Network.Ethernets[deviceId] = ethernet
        return nil
 }

Before the patch:

ubuntu@ip-172-31-54-202:/etc/netplan$ ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: enp39s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc mq master br-enp39s0 state UP group default qlen 1000
    link/ether 02:11:27:5e:dd:a7 brd ff:ff:ff:ff:ff:ff
3: br-enp39s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue state UP group default qlen 1000
    link/ether 52:a0:79:ef:19:39 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::50a0:79ff:feef:1939/64 scope link 
       valid_lft forever preferred_lft forever

After the patch:

ubuntu@ip-172-31-54-202:/etc/netplan$ ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: enp39s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc mq master br-enp39s0 state UP group default qlen 1000
    link/ether 02:11:27:5e:dd:a7 brd ff:ff:ff:ff:ff:ff
3: br-enp39s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue state UP group default qlen 1000
    link/ether 02:11:27:5e:dd:a7 brd ff:ff:ff:ff:ff:ff
    inet 172.31.54.202/20 metric 100 brd 172.31.63.255 scope global dynamic br-enp39s0
       valid_lft 3597sec preferred_lft 3597sec
    inet6 fe80::50a0:79ff:feef:1939/64 scope link 
       valid_lft forever preferred_lft forever

hpidcock avatar Jun 11 '24 06:06 hpidcock

@manadart perhaps we force container-networking-method to local until we get a chance to reimplement the auto configuration?

hpidcock avatar Jun 11 '24 06:06 hpidcock

@manadart perhaps we force container-networking-method to local until we get a chance to reimplement the auto configuration?

I will look at this during the current pulse.

manadart avatar Jul 15 '24 14:07 manadart

@manadart we have already re-implemented auto config for container networking. Checkout #17574

tlm avatar Jul 15 '24 22:07 tlm

/merge

barrettj12 avatar Aug 01 '24 20:08 barrettj12