homelab-kube-cluster icon indicating copy to clipboard operation
homelab-kube-cluster copied to clipboard

Dan's Homelab Kubernetes Cluster - Operated through Kustomize & ArgoCD

Dan Manners' Homelab

Current status: BETA (but is highly stable)

This project aims to utilize industry-standard tooling and practices in order to both perform it's functions and act as a repository for people to reference for their own learning and work.

🔍 Features

  • [x] Easy to replicate GitOps
  • [x] Modularity; make it easy to add/remove components
  • [x] Hybrid Multi-Cloud
  • [x] External DNS updates
  • [x] Automagic cert management
  • [x] In-Cluster Container Registry
  • [ ] Monitoring and alerting 🚧
  • [ ] Automated Backups 🚧
  • [x] ~~Cluster SSO through GitHub~~ - Removed when switching from K3s to Talos

💡 Current Tech Stack

Name Description
ArgoCD GitOps for Kubernetes
AWS Cloud Provider
Blocky Fast and lightweight DNS proxy as ad-blocker
Buildah Container Building
Cert-Manager Certificate Manager
Cilium CNI utilizing eBPF for Observability and Security
CloudNativePG Kubernetes operator covering lifecycle of HA PostgreSQL Clusters
CSI-Driver-NFS Kubernetes NFS Driver for persistent storage
Dex Federated OIDC
External-DNS Configure and manage External DNS servers
GitHub Popular Code Management through Git
Grafana Metrics Visualization
Helm Kubernetes Package Management
Jenkins Open-Source Automation Server
Kubernetes Container Orchestration
Kyverno Kubernetes Native Policy Management
Let's Encrypt Free TLS certificates
Maddy Composable all-in-one mail server
MetalLB Kubernetes bare-metal Load Balancer
Microsoft Azure Cloud Provider
Mozilla SOPS Simple/Flexible Tool
Podman Container and Pod management
Prometheus Metrics and Data Collection
Python Python Programming Language
Raspberry Pi Baremetal ARM SoC Hardware!
Reloader Kubernetes controller to watch cm's and secrets and reloads pods
SonarQube Static code analysis
Sonatype Nexus-OSS Manage binaries and build artifacts
Talos Secure, immutable, and minimal Linux OS
Tekton Cloud-Native CI/CD
Terraform Open-Source Infrastructure-as-Code
Terragrunt Making Terraform DRY
Ubuntu Operating System
Uptime Kuma Fancy self-hosted system monitoring
Vaultwarden Unofficial Bitwarden compatible server written in Rust; formerly bitwarden_rs
WikiJS Open-Source Wiki/Documentation Service

Removed Tech Stack

Several items have previously been in my cluster, but have been removed over time for one reason or another. Those items can be foud below.

Name Removal Reason Description
Ansible I don't need host provisioning anymore Ad-hoc system configuration-as-code
Amazon Linux 2 I standardized on Talos OS Operating System
Flannel CNI I migrated to Cilium for my CNI Network Fabric for Containers
K3s I moved to Talos for Kubernetes Lightweight Kubernetes
Harbor Kept crashing; logging was nearly useless. Might go back to it later, TBD. Open Source Container and Helm Registry
KubeLogin This was more of a hassle than anything, but worked perfectly well kubectl plugin for Kubernetes OpenID Connect authentication
NGINX This was no longer necessary with my edge TalosOS nodes Open-Source Web Server and Reverse Proxy
Proxmox Moved to fully baremetal Kubernetes on Talos OS Virtualization Platform
QEMU Guest Agent No longer necessary without virtualized on-prem infrastructure Provides access to a system-level agent via standard QMP commands
QNAP After an OS upgrade, NFS broke; built a new Array utilizing ZFS Storage Array Hardware and Networking
Rocky Linux I standardized on Talos OS Open-Source Enterprise Linux; Spiritual successor to CentOS
Turing Pi 1 I can't run Talos OS on the Turing Pi CM3+ nodes Raspberry Pi Compute Module Clustering
Turing Pi 2 The persistent storage options weren't enough with TalosOS Raspberry Pi Compute Module Clustering

Services Hosted

Name Description Path Relevant Link
Excalidraw Easy whiteboarding with excellent shortcuts! manifests/workloads/excalidraw GitHub - excalidraw/excalidraw
Jenkins OSS An older tool sir, but it checks out. manifests/workloads/jenkins-oss Website
Kube-Prometheus-Stack Easy to deploy Grafana, Prometheus rules, and the Prometheus Operator. manifests/workloads/kube-prometheus-stack-grafana GitHub - prometheus-community/helm-charts
Memegen The free and open source API to generate memes. manifests/workloads/memegen GitHub - jacebrowning/memegen
Node-Feature-Discovery Node feature discovery for Kubernetes manifests/workloads/node-feature-discovery GitHub - kubernetes-sigs/node-feature-discovery
OpenFaaS Serverless functions, made simple! manifests/workloads/openfaas-ingress Website
SonarQube OSS Code quality and code security manifests/workloads/sonarqube-oss Website
Spiderfoot Automated OSINT webcrawling manifests/workloads/spiderfoot Website
Traefik Cloud native application proxying; simplifying network complexity manifests/bootstrapping/traefik Website
WikiJS The most powerful and extensible open source Wiki software manifests/workloads/wikijs Website

Deprecated Services

The services listed below once existed in the cluster, but have since been removed for one reason or another

Name Deprecation Reason Description Path Relevant Link
Luzifer - One Time Secret No longer using it One-Time-Secret sharing platform with a symmetric 256bit AES encryption in the browser. manifests/workloads/luzifer-ots GitHub - Luzifer/ots
Non-Disclosure-Agreement No longer using it Flask app to obfuscate URL's and strings for obfuscated sharing of information. manifests/workloads/non-disclosure-agreement GitHub - danmaners/non-disclosure-agreement
Open Policy Agent Made obsolete by Kyverno Policy-based control for cloud native environments manifests/workloads/open-policy-agent Website
Proxmox No longer using virtualization in my on-prem homelab Compute, network, and storage in a single solution N/A Website
Rancher Upgrade Controller Removed from the cluster when I moved away from K3s In ur Kubernetes, upgrading ur nodes manifests/workloads/k3s-upgrade-controller GitHub - rancher/system-upgrade-controller

🔧 Hardware

Below is a list of the hardware (both physical and virtual) in use on this project

🖥 On-Prem Systems


Baremetal Talos Hosts

Count System Type CPU Type CPU Cores Memory
1 Desktop Intel Core i7-7700 4c8t 64GiB
1 Desktop AMD Ryzen 7 5800X 8c16t 64GiB
1 Desktop Intel Celeron J4105 4c4t 16GiB
1 Desktop AMD Ryzen 5 3400G 4c8t 32GiB

Cluster Boards

Count System Type CPU Type CPU Cores Memory
1 DeskPi Super6C 4x Raspberry Pi CM4 4c4t 4x 8GiB
~~1~~ ~~Turing Pi 2~~ ~~4x Raspberry Pi CM4~~ ~~4c4t~~ ~~4x 8GiB~~
~~1~~ ~~Turing Pi 1~~ ~~7x Raspberry Pi CM3+~~ ~~4c4t~~ ~~7x 1GiB~~
~~1~~ ~~Turing Pi 1~~ ~~3x Raspberry Pi CM3+~~ ~~4c4t~~ ~~3x 1GiB~~

Additional Compute

Count System Type CPU Type CPU Cores Memory
1 Raspberry Pi 4 Raspberry Pi CM4 4c4t 4GiB

Storage

Hardware Drive Count Memory CPU
Custom Build 3x 2.7TiB 7200RPM
3x 3.6TiB 7200RPM
2x 512GiB SSD
2x 512GiB NVMe
32GiB Intel 12th-Gen 12600
~~QNAP TS-332X~~ ~~3x M.2, 3x 3.5" 7200RPM~~ ~~16GiB~~ Alpine AL-324

Networking

Hardware SFP+ Ports SFP Ports 1Gb Eth Ports
Ubiquiti EdgeSwitch 24 Lite 0 2 24
Ubiquiti EdgeSwitch 8 150W 0 2 8
Mikrotik CRS305-1G-4S+ 4 0 1 (PoE In)

Cloud Hosted Resources

Name Provider Arch Instance Size CPU Memory
talos-azure-vm01 Azure amd64 Standard B2s 2vCPU 4GiB
talos-aws-grav01 AWS amd64 t4g.small 2vCPU 2GiB
~~tpi-k3s-aws-edge~~ ~~AWS~~ ~~arm64~~ ~~t4g.small~~ ~~2vCPU~~ ~~2GiB~~
~~tpi-k3s-aws-edge~~ ~~AWS~~ ~~amd64~~ ~~t3.medium~~ ~~2vCPU~~ ~~4GiB~~
~~tpi-k3s-azure-edge~~ ~~Azure~~ ~~amd64~~ Standard B2s ~~2vCPU~~ ~~4GiB~~

Deployment Order of Operations

While this section is very much a Work-in-Progress, I'd like to provide some relevant information on core services that must be deployed and in which order.

  1. Talos Linux
  2. Cilium CNI
  3. MetalLB
  4. Cert-Manager
  5. External-DNS
  6. Traefik
  7. ArgoCD - Part One
  8. ArgoCD - Part Two

Identifying Problems, Troubleshooting Steps, and more

Below are a few things that may be beneficial to you when troubleshooting or getting things up and operational

Traffic is not getting from the edge (cloud) nodes to the on-prem cluster networking

You can validate that your remote traffic is or isn't making it on site by using dig inside of the netshoot container

kubectl run temp-troubleshooting \
  --rm -it -n default \
  --overrides='{"apiVersion":"v1","spec":{"nodeSelector":{"kubernetes.io/hostname":"talos-aws-grav01"}}}' \
  --pod-running-timeout 3m \
  --image=docker.io/nicolaka/netshoot:latest \
  --command -- /bin/bash

Then, you can validate that you can reach CoreDNS or another pod/service IP from your remote node.

If you can prove it is not working, you may want to restart all of Cilium:

kubectl rollout restart -n kube-system daemonset cilium

To-Do Items

  • Ensure that ALL services are tagged for the appropriate hardware (arm64 or amd64) to ensure runtime success
    • Alternatively, ensure that all containers are built for multi-architecture.
  • Ensure that ALL application and service subdirectories have READMEs explaining what they're doing and what someone else may need to modify for their own environment

Gratitude and Thanks

This README redesign was inspired by several other homelab repos, individuals, and communities.

Individuals


Communities


The DevOps Lounge

Discord

K8s-at-Home

Discord

Without the inspiration and help of these individuals and communities, I don't think my own project would be nearly as far. Make sure to check out their projects as well!