homelab-kube-cluster
homelab-kube-cluster copied to clipboard
Dan's Homelab Kubernetes Cluster - Operated through Kustomize & ArgoCD
Dan Manners' Homelab
Current status: BETA (but is highly stable)
This project aims to utilize industry-standard tooling and practices in order to both perform it's functions and act as a repository for people to reference for their own learning and work.
🔍 Features
- [x] Easy to replicate GitOps
- [x] Modularity; make it easy to add/remove components
- [x] Hybrid Multi-Cloud
- [x] External DNS updates
- [x] Automagic cert management
- [x] In-Cluster Container Registry
- [ ] Monitoring and alerting 🚧
- [ ] Automated Backups 🚧
- [x] ~~Cluster SSO through GitHub~~ - Removed when switching from K3s to Talos
💡 Current Tech Stack
Name | Description |
---|---|
ArgoCD | GitOps for Kubernetes |
AWS | Cloud Provider |
Blocky | Fast and lightweight DNS proxy as ad-blocker |
Buildah | Container Building |
Cert-Manager | Certificate Manager |
Cilium | CNI utilizing eBPF for Observability and Security |
CloudNativePG | Kubernetes operator covering lifecycle of HA PostgreSQL Clusters |
CSI-Driver-NFS | Kubernetes NFS Driver for persistent storage |
Dex | Federated OIDC |
External-DNS | Configure and manage External DNS servers |
GitHub | Popular Code Management through Git |
Grafana | Metrics Visualization |
Helm | Kubernetes Package Management |
Jenkins | Open-Source Automation Server |
Kubernetes | Container Orchestration |
Kyverno | Kubernetes Native Policy Management |
Let's Encrypt | Free TLS certificates |
Maddy | Composable all-in-one mail server |
MetalLB | Kubernetes bare-metal Load Balancer |
Microsoft Azure | Cloud Provider |
Mozilla SOPS | Simple/Flexible Tool |
Podman | Container and Pod management |
Prometheus | Metrics and Data Collection |
Python | Python Programming Language |
Raspberry Pi | Baremetal ARM SoC Hardware! |
Reloader | Kubernetes controller to watch cm's and secrets and reloads pods |
SonarQube | Static code analysis |
Sonatype Nexus-OSS | Manage binaries and build artifacts |
Talos | Secure, immutable, and minimal Linux OS |
Tekton | Cloud-Native CI/CD |
Terraform | Open-Source Infrastructure-as-Code |
Terragrunt | Making Terraform DRY |
Ubuntu | Operating System |
Uptime Kuma | Fancy self-hosted system monitoring |
Vaultwarden | Unofficial Bitwarden compatible server written in Rust; formerly bitwarden_rs |
WikiJS | Open-Source Wiki/Documentation Service |
Removed Tech Stack
Several items have previously been in my cluster, but have been removed over time for one reason or another. Those items can be foud below.
Name | Removal Reason | Description |
---|---|---|
Ansible | I don't need host provisioning anymore | Ad-hoc system configuration-as-code |
Amazon Linux 2 | I standardized on Talos OS | Operating System |
Flannel CNI | I migrated to Cilium for my CNI | Network Fabric for Containers |
K3s | I moved to Talos for Kubernetes | Lightweight Kubernetes |
Harbor | Kept crashing; logging was nearly useless. Might go back to it later, TBD. | Open Source Container and Helm Registry |
KubeLogin | This was more of a hassle than anything, but worked perfectly well | kubectl plugin for Kubernetes OpenID Connect authentication |
NGINX | This was no longer necessary with my edge TalosOS nodes | Open-Source Web Server and Reverse Proxy |
Proxmox | Moved to fully baremetal Kubernetes on Talos OS | Virtualization Platform |
QEMU Guest Agent | No longer necessary without virtualized on-prem infrastructure | Provides access to a system-level agent via standard QMP commands |
QNAP | After an OS upgrade, NFS broke; built a new Array utilizing ZFS | Storage Array Hardware and Networking |
Rocky Linux | I standardized on Talos OS | Open-Source Enterprise Linux; Spiritual successor to CentOS |
Turing Pi 1 | I can't run Talos OS on the Turing Pi CM3+ nodes | Raspberry Pi Compute Module Clustering |
Turing Pi 2 | The persistent storage options weren't enough with TalosOS | Raspberry Pi Compute Module Clustering |
Services Hosted
Name | Description | Path | Relevant Link |
---|---|---|---|
Excalidraw | Easy whiteboarding with excellent shortcuts! | manifests/workloads/excalidraw | GitHub - excalidraw/excalidraw |
Jenkins OSS | An older tool sir, but it checks out. | manifests/workloads/jenkins-oss | Website |
Kube-Prometheus-Stack | Easy to deploy Grafana, Prometheus rules, and the Prometheus Operator. | manifests/workloads/kube-prometheus-stack-grafana | GitHub - prometheus-community/helm-charts |
Memegen | The free and open source API to generate memes. | manifests/workloads/memegen | GitHub - jacebrowning/memegen |
Node-Feature-Discovery | Node feature discovery for Kubernetes | manifests/workloads/node-feature-discovery | GitHub - kubernetes-sigs/node-feature-discovery |
OpenFaaS | Serverless functions, made simple! | manifests/workloads/openfaas-ingress | Website |
SonarQube OSS | Code quality and code security | manifests/workloads/sonarqube-oss | Website |
Spiderfoot | Automated OSINT webcrawling | manifests/workloads/spiderfoot | Website |
Traefik | Cloud native application proxying; simplifying network complexity | manifests/bootstrapping/traefik | Website |
WikiJS | The most powerful and extensible open source Wiki software | manifests/workloads/wikijs | Website |
Deprecated Services
The services listed below once existed in the cluster, but have since been removed for one reason or another
Name | Deprecation Reason | Description | Path | Relevant Link |
---|---|---|---|---|
Luzifer - One Time Secret | No longer using it | One-Time-Secret sharing platform with a symmetric 256bit AES encryption in the browser. | manifests/workloads/luzifer-ots | GitHub - Luzifer/ots |
Non-Disclosure-Agreement | No longer using it | Flask app to obfuscate URL's and strings for obfuscated sharing of information. | manifests/workloads/non-disclosure-agreement | GitHub - danmaners/non-disclosure-agreement |
Open Policy Agent | Made obsolete by Kyverno | Policy-based control for cloud native environments | manifests/workloads/open-policy-agent | Website |
Proxmox | No longer using virtualization in my on-prem homelab | Compute, network, and storage in a single solution | N/A | Website |
Rancher Upgrade Controller | Removed from the cluster when I moved away from K3s | In ur Kubernetes, upgrading ur nodes | manifests/workloads/k3s-upgrade-controller | GitHub - rancher/system-upgrade-controller |
🔧 Hardware
Below is a list of the hardware (both physical and virtual) in use on this project
🖥 On-Prem Systems
Baremetal Talos Hosts
Count | System Type | CPU Type | CPU Cores | Memory |
---|---|---|---|---|
1 | Desktop | Intel Core i7-7700 | 4c8t | 64GiB |
1 | Desktop | AMD Ryzen 7 5800X | 8c16t | 64GiB |
1 | Desktop | Intel Celeron J4105 | 4c4t | 16GiB |
1 | Desktop | AMD Ryzen 5 3400G | 4c8t | 32GiB |
Cluster Boards
Count | System Type | CPU Type | CPU Cores | Memory |
---|---|---|---|---|
1 | DeskPi Super6C | 4x Raspberry Pi CM4 | 4c4t | 4x 8GiB |
~~1~~ | ~~Turing Pi 2~~ | ~~4x Raspberry Pi CM4~~ | ~~4c4t~~ | ~~4x 8GiB~~ |
~~1~~ | ~~Turing Pi 1~~ | ~~7x Raspberry Pi CM3+~~ | ~~4c4t~~ | ~~7x 1GiB~~ |
~~1~~ | ~~Turing Pi 1~~ | ~~3x Raspberry Pi CM3+~~ | ~~4c4t~~ | ~~3x 1GiB~~ |
Additional Compute
Count | System Type | CPU Type | CPU Cores | Memory |
---|---|---|---|---|
1 | Raspberry Pi 4 | Raspberry Pi CM4 | 4c4t | 4GiB |
Storage
Hardware | Drive Count | Memory | CPU |
---|---|---|---|
Custom Build | 3x 2.7TiB 7200RPM 3x 3.6TiB 7200RPM 2x 512GiB SSD 2x 512GiB NVMe |
32GiB | Intel 12th-Gen 12600 |
~~QNAP TS-332X~~ | ~~3x M.2, 3x 3.5" 7200RPM~~ | ~~16GiB~~ | Alpine AL-324 |
Networking
Hardware | SFP+ Ports | SFP Ports | 1Gb Eth Ports |
---|---|---|---|
Ubiquiti EdgeSwitch 24 Lite | 0 | 2 | 24 |
Ubiquiti EdgeSwitch 8 150W | 0 | 2 | 8 |
Mikrotik CRS305-1G-4S+ | 4 | 0 | 1 (PoE In) |
Cloud Hosted Resources
Name | Provider | Arch | Instance Size | CPU | Memory |
---|---|---|---|---|---|
talos-azure-vm01 | Azure | amd64 | Standard B2s | 2vCPU | 4GiB |
talos-aws-grav01 | AWS | amd64 | t4g.small | 2vCPU | 2GiB |
~~tpi-k3s-aws-edge~~ | ~~AWS~~ | ~~arm64~~ | ~~t4g.small~~ | ~~2vCPU~~ | ~~2GiB~~ |
~~tpi-k3s-aws-edge~~ | ~~AWS~~ | ~~amd64~~ | ~~t3.medium~~ | ~~2vCPU~~ | ~~4GiB~~ |
~~tpi-k3s-azure-edge~~ | ~~Azure~~ | ~~amd64~~ | Standard B2s | ~~2vCPU~~ | ~~4GiB~~ |
Deployment Order of Operations
While this section is very much a Work-in-Progress, I'd like to provide some relevant information on core services that must be deployed and in which order.
- Talos Linux
- Cilium CNI
- MetalLB
- Cert-Manager
- External-DNS
- Traefik
- ArgoCD - Part One
- ArgoCD - Part Two
Identifying Problems, Troubleshooting Steps, and more
Below are a few things that may be beneficial to you when troubleshooting or getting things up and operational
Traffic is not getting from the edge (cloud) nodes to the on-prem cluster networking
You can validate that your remote traffic is or isn't making it on site by using dig
inside of the netshoot container
kubectl run temp-troubleshooting \
--rm -it -n default \
--overrides='{"apiVersion":"v1","spec":{"nodeSelector":{"kubernetes.io/hostname":"talos-aws-grav01"}}}' \
--pod-running-timeout 3m \
--image=docker.io/nicolaka/netshoot:latest \
--command -- /bin/bash
Then, you can validate that you can reach CoreDNS or another pod/service IP from your remote node.
If you can prove it is not working, you may want to restart all of Cilium:
kubectl rollout restart -n kube-system daemonset cilium
To-Do Items
- Ensure that ALL services are tagged for the appropriate hardware (
arm64
oramd64
) to ensure runtime success- Alternatively, ensure that all containers are built for multi-architecture.
- Ensure that ALL application and service subdirectories have READMEs explaining what they're doing and what someone else may need to modify for their own environment
Gratitude and Thanks
This README redesign was inspired by several other homelab repos, individuals, and communities.
Individuals
Communities
The DevOps Lounge
K8s-at-Home
Without the inspiration and help of these individuals and communities, I don't think my own project would be nearly as far. Make sure to check out their projects as well!