leapfrogai icon indicating copy to clipboard operation
leapfrogai copied to clipboard

spike: UDS Core LFAI infrastructure bundle

Open justinthelaw opened this issue 1 year ago • 19 comments

LFAI delivery requires a production-ready infrastructure bundle that will bootstrap an RKE2 cluster and setup the networking, auth, CRDs and policies necessary for the rest of the LFAI applications layers to be deployed.

  • [x] How do I prepare an air-gapped Ubuntu 20.04 OS for RKE2 installation (STIG, networking, etc.)?
    • [x] How do I consistently experiment with this sort of system-level configuration?
  • [x] How do I bootstrap a local RKE2 cluster in an air-gapped Ubuntu 22.04 OS?
  • [ ] How do I properly install and configure all UDS Core + Istio components into the bootstrapped cluster?
  • [x] How do I create a comprehensive bundle with UDS tasks to make the above repeatable?
    • [x] How should I create a CI workflow to automate the UDS bundling and related testing?

justinthelaw avatar Mar 25 '24 20:03 justinthelaw

Some Defense Unicorns related resources:

  • https://github.com/defenseunicorns/zarf-package-rke2-init
  • https://github.com/defenseunicorns/uds-rke2-image-builder
  • https://github.com/defenseunicorns/uds-prod-infrastructure
  • https://github.com/defenseunicorns/uds-core

justinthelaw avatar Mar 25 '24 21:03 justinthelaw

Asked for help and clarifications via this thread for this UDS RKE2 resource: https://github.com/defenseunicorns/uds-rke2-image-builder

justinthelaw avatar Mar 26 '24 15:03 justinthelaw

Resources GPU pass through in linux/amd64, Intel, NVIDIA, and virt-manager + qemu. Required for clean Ubuntu server VM setup during local development and testing.

  • https://github.com/bryansteiner/gpu-passthrough-tutorial
  • https://youtu.be/KVDUs019IB8?si=QI5OqyeiuhkoRVL-

justinthelaw avatar Mar 27 '24 15:03 justinthelaw

All work and findings will be consolidated here: https://github.com/justinthelaw/uds-rke2-sandbox

Transfer to an official defenseunicorns organization repository will be done upon completion of the spike and the follow-on feature.

justinthelaw avatar Apr 01 '24 18:04 justinthelaw

The working branch that needs to be merged and tested in order for this spike to be closed: https://github.com/justinthelaw/uds-rke2-sandbox/tree/rke2-os-configuration-stig

Related draft PR: https://github.com/justinthelaw/uds-rke2-sandbox/pull/2

justinthelaw avatar Apr 01 '24 22:04 justinthelaw

https://github.com/defenseunicorns/uds-capability-rook-ceph

gscallon avatar Apr 04 '24 17:04 gscallon

Potential solution for zarf-init related issues, if applicable:

rke2 offers a local provider for storage class, but this is not suitable for our purposes. A multi-node cluster will explode the zarf init. We need to specify a storage class to use in order to be able to init the zarf registry. The zarf init package expects a storage class which is used to create a persistent volume. Basic flow would be something like this:

  1. Install rke2.
  2. Utilize the custom zarf init package with rook-ceph to zarf init the rke2 cluster.
  3. This creates the storage provider, which can be used in the population of the zarf registry

If it is a storage class issue, we should be seeing some weirdness in the zarf pod descriptions or events from the namespaces (PVs failing to bind, pods waiting for claims, or errors that the storage class doesn't exist).

gscallon avatar Apr 04 '24 17:04 gscallon

Ahh okay @gscallon I think I was running into this error just now on uds zarf init. The PVC was not attaching properly. Thanks for providing some more context!

justinthelaw avatar Apr 04 '24 17:04 justinthelaw

We ran into an issue with containerd/k3s.

If you run zarf init at all, this starts a k3s service which can block containerd from starting properly.

More info on containerd configs here:

https://www.nocentino.com/posts/2021-12-27-installing-and-configuring-containerd-as-a-kubernetes-container-runtime/ https://github.com/containerd/cri/blob/master/docs/config.md

gscallon avatar Apr 08 '24 18:04 gscallon

Utilizing the custom config provided by other unicorns caused some issues...

We were able to get RKE2 spun up, but the provided config is not working.

Here is some information:

  Kernel Version:             5.4.0-176-generic
  OS Image:                   Ubuntu 20.04.6 LTS
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.7.11-k3s2

gscallon avatar Apr 08 '24 20:04 gscallon

root@jlaw-server:/home/jlaw# k describe po/zarf-docker-registry-59c964db5c-prgbs -n zarf

 NOTE  Using config file
Name:                 zarf-docker-registry-59c964db5c-prgbs
Namespace:            zarf
Priority:             2000001000
Priority Class Name:  system-node-critical
Service Account:      default
Node:                 jlaw-server/192.168.122.77
Start Time:           Mon, 08 Apr 2024 20:10:31 +0000
Labels:               app=docker-registry
                      pod-template-hash=59c964db5c
                      release=zarf-docker-registry
                      zarf.dev/agent=ignore
Annotations:          checksum/secret: 8ee990088bcf5417ed5717aa58b14678c374e29170461bb099d719ecc1a320ad
                      cni.projectcalico.org/containerID: 453c8f8f76353eda719b791d6abf4f59f66a49828a9b5261aae854b2372834dc
                      cni.projectcalico.org/podIP: 10.42.0.10/32
                      cni.projectcalico.org/podIPs: 10.42.0.10/32
Status:               Pending
IP:                   10.42.0.10
IPs:
  IP:           10.42.0.10
Controlled By:  ReplicaSet/zarf-docker-registry-59c964db5c
Containers:
  docker-registry:
    Container ID:
    Image:         127.0.0.1:31109/library/registry:2.8.3
    Image ID:
    Port:          5000/TCP
    Host Port:     0/TCP
    Command:
      /bin/registry
      serve
      /etc/docker/registry/config.yml
    State:          Waiting
      Reason:       ImagePullBackOff
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     3
      memory:  2Gi
    Requests:
      cpu:      100m
      memory:   256Mi
    Liveness:   http-get http://:5000/ delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness:  http-get http://:5000/ delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      REGISTRY_AUTH:                              htpasswd
      REGISTRY_AUTH_HTPASSWD_REALM:               Registry Realm
      REGISTRY_AUTH_HTPASSWD_PATH:                /etc/docker/registry/htpasswd
      REGISTRY_STORAGE_DELETE_ENABLED:            true
      REGISTRY_STORAGE_FILESYSTEM_ROOTDIRECTORY:  /var/lib/registry
    Mounts:
      /etc/docker/registry from config (rw)
      /var/lib/registry/ from data (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-cc24w (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  config:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  zarf-docker-registry-secret
    Optional:    false
  data:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  20Gi
  kube-api-access-cc24w:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  64s                default-scheduler  Successfully assigned zarf/zarf-docker-registry-59c964db5c-prgbs to jlaw-server
  Warning  Failed     30s (x2 over 54s)  kubelet            Failed to pull image "127.0.0.1:31109/library/registry:2.8.3": failed to pull and unpack image "127.0.0.1:31109/library/registry:2.8.3": failed to resolve reference "127.0.0.1:31109/library/registry:2.8.3": failed to do request: Head "https://127.0.0.1:31109/v2/library/registry/manifests/2.8.3": net/http: TLS handshake timeout
  Warning  Failed     30s (x2 over 54s)  kubelet            Error: ErrImagePull
  Normal   BackOff    15s (x2 over 53s)  kubelet            Back-off pulling image "127.0.0.1:31109/library/registry:2.8.3"
  Warning  Failed     15s (x2 over 53s)  kubelet            Error: ImagePullBackOff
  Normal   Pulling    3s (x3 over 64s)   kubelet            Pulling image "127.0.0.1:31109/library/registry:2.8.3"

gscallon avatar Apr 08 '24 20:04 gscallon

In the startup script, we should do the following:

  • [ ] check whether k3s is running / taking up a socket / etc. Do a fresh install as needed
  • [ ] check whether containerd is running. Do a fresh install as needed

gscallon avatar Apr 08 '24 20:04 gscallon

Made progress and items are coming up for the rook-ceph cluster:

root@jlaw-server:~# k --namespace rook-ceph get cephcluster 

NOTE  Using config file
NAME        DATADIRHOSTPATH   MONCOUNT   AGE     PHASE         MESSAGE                                                                                                                     HEALTH   EXTERNAL   FSID
rook-ceph   /var/lib/rook     3          7m31s   Progressing   failed to perform validation before cluster creation: cannot start 3 mons on 1 node(s) when allowMultiplePerNode is false  

gscallon avatar Apr 08 '24 20:04 gscallon

We were able to get around the above issue by editing the upstream manifest in order to reduce the number of mons from 3 to 1.

This likely isn't a great long-term fix, and we should consider the following options:

  • [ ] edit the upstream manifest for our deployment to allow for multiple mons to run on one node, or else make a HA version of this deployment which deploys to multiple nodes
  • [ ] edit the number of mgrs or else run them in a way that they aren't conflicting with each other (blocking each other with the same ports)

QUESTIONS TO ANSWER:

  • what are the mons and mgrs? what's the best practice for setup? what will we need resource-wise?
  • what does this deployment look like for HA?

gscallon avatar Apr 08 '24 21:04 gscallon

We ran into issues with the blockpools that were being created in the cluster by rook-ceph.

Ceph research / info:

Definitions:

  • OSD - object store daemon - stores data, handles data replication, recovery, etc.
  • mon - Ceph Monitor - maintains maps of the cluster state. They're also responsible for managing authentication between daemons and clients.
  • mgr - Ceph Managers - a Ceph Manager daemon is responsible for keeping track of runtime metrics and the current state of the Ceph cluster (including things like storage utilization, current performance metrics, system load, etc.)
  • MDS - Ceph Metadata Server - stores metadata for the Ceph file system

What are pools? - pools are logical partitions that are used to store objects

Best practices:

  • have a ceph manager for each monitor (although not necessary)
  • at least 3 monitors are normally required for redundancy and high availability
  • at least 2 managers are normally required for high availability
  • at least 3 OSDs are normally required for redundancy and high availability

gscallon avatar Apr 09 '24 15:04 gscallon

Items for updating the Zarf package used for the sandbox env:

  • [ ] add rook-ceph capability Zarf package
  • [ ] modify rook-ceph capability charts in order to support single-node setup. (extra - leave possibility open for alternative HA configuration?). This includes the following changes:
    • [ ] reduce the number of monitors

EDIT (@justinthelaw): I created a GH issue on the original rook-ceph zarf-init capability that we are using.

gscallon avatar Apr 09 '24 16:04 gscallon

Rook-ceph init + MinIO is progressing, but PVC is not being created possibly due to some Ceph cluster warnings:

status:                                                         
  ceph:                                                         
    capacity: {}                                                
    details:                                                    
      MDS_SLOW_METADATA_IO:                                     
        message: 1 MDSs report slow metadata IOs                
        severity: HEALTH_WARN                                   
      PG_AVAILABILITY:                                          
        message: 'Reduced data availability: 60 pgs inactive'   
        severity: HEALTH_WARN                                                          
      TOO_FEW_OSDS:                                                                    
        message: OSD count 0 < osd_pool_default_size 1                                 
        severity: HEALTH_WARN                                                          
    fsid: a937d71c-eb6f-44ff-921e-80be147d4341                                         
    health: HEALTH_WARN                                                                
    lastChanged: "2024-04-23T18:38:01Z"                                                
    lastChecked: "2024-04-23T20:21:48Z"                                                
    previousHealth: HEALTH_OK   

Automation of entire deployment has been modularized for easier continued development and more robust air-gapping procedures in the future.

UDS core tasks are now being used for the automation of workflows and local development, and remote task execution for creating/deploying UDS Core work properly.

Remaining question(s) to ask:

  1. Do we even need MinIO if we have Rook-Ceph?
  2. What does subdomain mapping look like on the RKE2 server node's host?

justinthelaw avatar Apr 23 '24 20:04 justinthelaw

New WIP PR to track for completion of this ticket: https://github.com/justinthelaw/uds-rke2/pull/7

New WIP PR to come (@gscallon) for the Longhorn + MinIO version of the Zarf Init.

  • [ ] Rook-Ceph Zarf Init working for single-node
  • [x] Skeletons for Longhorn or Local Path Provisioner, with MinIO, Zarf Init
  • [x] Refactored Zarf packages for air-gapping
  • [x] Refactored repo and charts structure for better alignment with UDS, Zarf, and Helm patterns
  • [x] Added more docs for general usage, storage layer bundle "flavors", and VM creation

justinthelaw avatar May 02 '24 16:05 justinthelaw

After discussion with Jon Andrew (Defense Unicorns, HNCD) we have gone full-ahead on Local Path Provisioner: https://github.com/justinthelaw/uds-rke2/pull/18

Additionally, significant improvements to the release, test and general workflows process has been improved with release-please and uds tasks (uds-cli).

justinthelaw avatar May 08 '24 22:05 justinthelaw

New issue created: https://github.com/justinthelaw/uds-rke2/issues/23

Closure of base UDS RKE2 infrastructure. Working on UDS bundling, overrides, and *.uds.dev (etc.) ingress through the above issue.

justinthelaw avatar May 28 '24 18:05 justinthelaw

Release of uds-rke2 0.4.2 (https://github.com/justinthelaw/uds-rke2/releases) completes this spike in its entirety.

justinthelaw avatar Jun 18 '24 14:06 justinthelaw