spike: UDS Core LFAI infrastructure bundle
LFAI delivery requires a production-ready infrastructure bundle that will bootstrap an RKE2 cluster and setup the networking, auth, CRDs and policies necessary for the rest of the LFAI applications layers to be deployed.
- [x] How do I prepare an air-gapped Ubuntu 20.04 OS for RKE2 installation (STIG, networking, etc.)?
- [x] How do I consistently experiment with this sort of system-level configuration?
- [x] How do I bootstrap a local RKE2 cluster in an air-gapped Ubuntu 22.04 OS?
- [ ] How do I properly install and configure all UDS Core + Istio components into the bootstrapped cluster?
- [x] How do I create a comprehensive bundle with UDS tasks to make the above repeatable?
- [x] How should I create a CI workflow to automate the UDS bundling and related testing?
Some Defense Unicorns related resources:
- https://github.com/defenseunicorns/zarf-package-rke2-init
- https://github.com/defenseunicorns/uds-rke2-image-builder
- https://github.com/defenseunicorns/uds-prod-infrastructure
- https://github.com/defenseunicorns/uds-core
Asked for help and clarifications via this thread for this UDS RKE2 resource: https://github.com/defenseunicorns/uds-rke2-image-builder
Resources GPU pass through in linux/amd64, Intel, NVIDIA, and virt-manager + qemu. Required for clean Ubuntu server VM setup during local development and testing.
- https://github.com/bryansteiner/gpu-passthrough-tutorial
- https://youtu.be/KVDUs019IB8?si=QI5OqyeiuhkoRVL-
All work and findings will be consolidated here: https://github.com/justinthelaw/uds-rke2-sandbox
Transfer to an official defenseunicorns organization repository will be done upon completion of the spike and the follow-on feature.
The working branch that needs to be merged and tested in order for this spike to be closed: https://github.com/justinthelaw/uds-rke2-sandbox/tree/rke2-os-configuration-stig
Related draft PR: https://github.com/justinthelaw/uds-rke2-sandbox/pull/2
https://github.com/defenseunicorns/uds-capability-rook-ceph
Potential solution for zarf-init related issues, if applicable:
rke2 offers a local provider for storage class, but this is not suitable for our purposes. A multi-node cluster will explode the zarf init. We need to specify a storage class to use in order to be able to init the zarf registry. The zarf init package expects a storage class which is used to create a persistent volume. Basic flow would be something like this:
- Install rke2.
- Utilize the custom zarf init package with rook-ceph to zarf init the rke2 cluster.
- This creates the storage provider, which can be used in the population of the zarf registry
If it is a storage class issue, we should be seeing some weirdness in the zarf pod descriptions or events from the namespaces (PVs failing to bind, pods waiting for claims, or errors that the storage class doesn't exist).
Ahh okay @gscallon I think I was running into this error just now on uds zarf init. The PVC was not attaching properly. Thanks for providing some more context!
We ran into an issue with containerd/k3s.
If you run zarf init at all, this starts a k3s service which can block containerd from starting properly.
More info on containerd configs here:
https://www.nocentino.com/posts/2021-12-27-installing-and-configuring-containerd-as-a-kubernetes-container-runtime/ https://github.com/containerd/cri/blob/master/docs/config.md
Utilizing the custom config provided by other unicorns caused some issues...
We were able to get RKE2 spun up, but the provided config is not working.
Here is some information:
Kernel Version: 5.4.0-176-generic
OS Image: Ubuntu 20.04.6 LTS
Operating System: linux
Architecture: amd64
Container Runtime Version: containerd://1.7.11-k3s2
root@jlaw-server:/home/jlaw# k describe po/zarf-docker-registry-59c964db5c-prgbs -n zarf
NOTE Using config file
Name: zarf-docker-registry-59c964db5c-prgbs
Namespace: zarf
Priority: 2000001000
Priority Class Name: system-node-critical
Service Account: default
Node: jlaw-server/192.168.122.77
Start Time: Mon, 08 Apr 2024 20:10:31 +0000
Labels: app=docker-registry
pod-template-hash=59c964db5c
release=zarf-docker-registry
zarf.dev/agent=ignore
Annotations: checksum/secret: 8ee990088bcf5417ed5717aa58b14678c374e29170461bb099d719ecc1a320ad
cni.projectcalico.org/containerID: 453c8f8f76353eda719b791d6abf4f59f66a49828a9b5261aae854b2372834dc
cni.projectcalico.org/podIP: 10.42.0.10/32
cni.projectcalico.org/podIPs: 10.42.0.10/32
Status: Pending
IP: 10.42.0.10
IPs:
IP: 10.42.0.10
Controlled By: ReplicaSet/zarf-docker-registry-59c964db5c
Containers:
docker-registry:
Container ID:
Image: 127.0.0.1:31109/library/registry:2.8.3
Image ID:
Port: 5000/TCP
Host Port: 0/TCP
Command:
/bin/registry
serve
/etc/docker/registry/config.yml
State: Waiting
Reason: ImagePullBackOff
Ready: False
Restart Count: 0
Limits:
cpu: 3
memory: 2Gi
Requests:
cpu: 100m
memory: 256Mi
Liveness: http-get http://:5000/ delay=0s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get http://:5000/ delay=0s timeout=1s period=10s #success=1 #failure=3
Environment:
REGISTRY_AUTH: htpasswd
REGISTRY_AUTH_HTPASSWD_REALM: Registry Realm
REGISTRY_AUTH_HTPASSWD_PATH: /etc/docker/registry/htpasswd
REGISTRY_STORAGE_DELETE_ENABLED: true
REGISTRY_STORAGE_FILESYSTEM_ROOTDIRECTORY: /var/lib/registry
Mounts:
/etc/docker/registry from config (rw)
/var/lib/registry/ from data (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-cc24w (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
config:
Type: Secret (a volume populated by a Secret)
SecretName: zarf-docker-registry-secret
Optional: false
data:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: 20Gi
kube-api-access-cc24w:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 64s default-scheduler Successfully assigned zarf/zarf-docker-registry-59c964db5c-prgbs to jlaw-server
Warning Failed 30s (x2 over 54s) kubelet Failed to pull image "127.0.0.1:31109/library/registry:2.8.3": failed to pull and unpack image "127.0.0.1:31109/library/registry:2.8.3": failed to resolve reference "127.0.0.1:31109/library/registry:2.8.3": failed to do request: Head "https://127.0.0.1:31109/v2/library/registry/manifests/2.8.3": net/http: TLS handshake timeout
Warning Failed 30s (x2 over 54s) kubelet Error: ErrImagePull
Normal BackOff 15s (x2 over 53s) kubelet Back-off pulling image "127.0.0.1:31109/library/registry:2.8.3"
Warning Failed 15s (x2 over 53s) kubelet Error: ImagePullBackOff
Normal Pulling 3s (x3 over 64s) kubelet Pulling image "127.0.0.1:31109/library/registry:2.8.3"
In the startup script, we should do the following:
- [ ] check whether k3s is running / taking up a socket / etc. Do a fresh install as needed
- [ ] check whether containerd is running. Do a fresh install as needed
Made progress and items are coming up for the rook-ceph cluster:
root@jlaw-server:~# k --namespace rook-ceph get cephcluster
NOTE Using config file
NAME DATADIRHOSTPATH MONCOUNT AGE PHASE MESSAGE HEALTH EXTERNAL FSID
rook-ceph /var/lib/rook 3 7m31s Progressing failed to perform validation before cluster creation: cannot start 3 mons on 1 node(s) when allowMultiplePerNode is false
We were able to get around the above issue by editing the upstream manifest in order to reduce the number of mons from 3 to 1.
This likely isn't a great long-term fix, and we should consider the following options:
- [ ] edit the upstream manifest for our deployment to allow for multiple mons to run on one node, or else make a HA version of this deployment which deploys to multiple nodes
- [ ] edit the number of mgrs or else run them in a way that they aren't conflicting with each other (blocking each other with the same ports)
QUESTIONS TO ANSWER:
- what are the mons and mgrs? what's the best practice for setup? what will we need resource-wise?
- what does this deployment look like for HA?
We ran into issues with the blockpools that were being created in the cluster by rook-ceph.
Ceph research / info:
- OSD - object store daemon - stores data, handles data replication, recovery, etc.
- mon - Ceph Monitor - maintains maps of the cluster state. They're also responsible for managing authentication between daemons and clients.
- mgr - Ceph Managers - a Ceph Manager daemon is responsible for keeping track of runtime metrics and the current state of the Ceph cluster (including things like storage utilization, current performance metrics, system load, etc.)
- MDS - Ceph Metadata Server - stores metadata for the Ceph file system
What are pools? - pools are logical partitions that are used to store objects
Best practices:
- have a ceph manager for each monitor (although not necessary)
- at least 3 monitors are normally required for redundancy and high availability
- at least 2 managers are normally required for high availability
- at least 3 OSDs are normally required for redundancy and high availability
Items for updating the Zarf package used for the sandbox env:
- [ ] add rook-ceph capability Zarf package
- [ ] modify rook-ceph capability charts in order to support single-node setup. (extra - leave possibility open for alternative HA configuration?). This includes the following changes:
- [ ] reduce the number of monitors
EDIT (@justinthelaw): I created a GH issue on the original rook-ceph zarf-init capability that we are using.
Rook-ceph init + MinIO is progressing, but PVC is not being created possibly due to some Ceph cluster warnings:
status:
ceph:
capacity: {}
details:
MDS_SLOW_METADATA_IO:
message: 1 MDSs report slow metadata IOs
severity: HEALTH_WARN
PG_AVAILABILITY:
message: 'Reduced data availability: 60 pgs inactive'
severity: HEALTH_WARN
TOO_FEW_OSDS:
message: OSD count 0 < osd_pool_default_size 1
severity: HEALTH_WARN
fsid: a937d71c-eb6f-44ff-921e-80be147d4341
health: HEALTH_WARN
lastChanged: "2024-04-23T18:38:01Z"
lastChecked: "2024-04-23T20:21:48Z"
previousHealth: HEALTH_OK
Automation of entire deployment has been modularized for easier continued development and more robust air-gapping procedures in the future.
UDS core tasks are now being used for the automation of workflows and local development, and remote task execution for creating/deploying UDS Core work properly.
Remaining question(s) to ask:
- Do we even need MinIO if we have Rook-Ceph?
- What does subdomain mapping look like on the RKE2 server node's host?
New WIP PR to track for completion of this ticket: https://github.com/justinthelaw/uds-rke2/pull/7
New WIP PR to come (@gscallon) for the Longhorn + MinIO version of the Zarf Init.
- [ ] Rook-Ceph Zarf Init working for single-node
- [x] Skeletons for Longhorn or Local Path Provisioner, with MinIO, Zarf Init
- [x] Refactored Zarf packages for air-gapping
- [x] Refactored repo and charts structure for better alignment with UDS, Zarf, and Helm patterns
- [x] Added more docs for general usage, storage layer bundle "flavors", and VM creation
After discussion with Jon Andrew (Defense Unicorns, HNCD) we have gone full-ahead on Local Path Provisioner: https://github.com/justinthelaw/uds-rke2/pull/18
Additionally, significant improvements to the release, test and general workflows process has been improved with release-please and uds tasks (uds-cli).
New issue created: https://github.com/justinthelaw/uds-rke2/issues/23
Closure of base UDS RKE2 infrastructure. Working on UDS bundling, overrides, and *.uds.dev (etc.) ingress through the above issue.
Release of uds-rke2 0.4.2 (https://github.com/justinthelaw/uds-rke2/releases) completes this spike in its entirety.