selkies-operator icon indicating copy to clipboard operation
selkies-operator copied to clipboard

K8s operator for per-user stateful workloads

Selkies - Stateful Workload Operator

Discord

Selkies is a platform built on GKE to orchestrate per-user stateful workloads.

Quick start

Assumptions

  • You are a member of a Google Cloud organization.
    • This is required for setup/scripts/create_oauth_client.sh to use gcloud alpha iap oauth-brand commands, because these implicity operate on organization-internal brands. For more information, see this guide.
  • You are granted the Owner role in a project in that organization.
  • You have gcloud installed in your environment.

Steps

The steps below will create the infrastructure for the app launcher. You should deploy to a new project.

  1. Clone the source repository:

    git clone -b master https://github.com/selkies-project/selkies.git
    cd selkies
    
  2. Configure gcloud (replace XXX & us-west1 with your project ID & preferred region):

    export PROJECT_ID=XXX
    export REGION=us-west1
    gcloud config set project ${PROJECT_ID?}
    gcloud config set compute/region ${REGION?}
    
  3. Enable the required GCP project services:

    gcloud services enable \
        --project ${PROJECT_ID?} \
        cloudresourcemanager.googleapis.com \
        compute.googleapis.com \
        container.googleapis.com \
        cloudbuild.googleapis.com \
        servicemanagement.googleapis.com \
        serviceusage.googleapis.com \
        stackdriver.googleapis.com \
        secretmanager.googleapis.com \
        iap.googleapis.com
    
  4. Grant the cloud build service account permissions on your project:

    PROJECT_NUMBER=$(
      gcloud projects describe ${PROJECT_ID?} \
        --format='value(projectNumber)'
    ) && \
      CLOUDBUILD_SA="${PROJECT_NUMBER?}@cloudbuild.gserviceaccount.com" && \
      gcloud projects add-iam-policy-binding ${PROJECT_ID?} \
        --member serviceAccount:${CLOUDBUILD_SA?} \
        --role roles/owner && \
      gcloud projects add-iam-policy-binding ${PROJECT_ID?} \
        --member serviceAccount:${CLOUDBUILD_SA?} \
        --role roles/iam.serviceAccountTokenCreator
    
  5. Deploy with Cloud Build:

    ACCOUNT=$(gcloud config get-value account) && \
      gcloud builds submit \
        --project=${PROJECT_ID?} \
        --substitutions=_USER=${ACCOUNT?},_REGION=${REGION?}
    
  6. Deploy sample app:

    (cd examples/jupyter-notebook/ && \
      gcloud builds submit \
        --project=${PROJECT_ID?} \
        --substitutions=_REGION=${REGION?})
    
  7. Connect to the App Launcher web interface at the URL output below:

    echo "https://broker.endpoints.${PROJECT_ID?}.cloud.goog/"
    

Troubleshooting

  • If the initial cloud build fails with the message Step #2 - "create-oauth-client": ERROR: (gcloud.alpha.iap.oauth-brands.list) INVALID_ARGUMENT: Request contains an invalid argument., it is most likely due to running as a user that is not a member of the Cloud Identity Organization. See the assumption described above.

  • If the initial cloud build fails with the message Step #2 - "create-oauth-client": ERROR: (gcloud.alpha.iap.oauth-clients.create) FAILED_PRECONDITION: Precondition check failed., it is most likely due to reusing a project that already had its OAuth consent screen set to "External", which cannot be changed via gcloud. Click the "MAKE INTERNAL" button here in your project.

  • If a wget step fails, retry the same command. Some third-party artifact URLs are flaky (due to globally-rate-limited hosts).

  • If your region only has 500 GB of Persistent Disk SSD quota, run the following, but keep in mind the number of apps and image pull performance will be affected.

    cat - > selkies-min-ssd.auto.tfvars <<EOF
    default_pool_disk_size_gb = 100
    turn_pool_disk_size_gb = 100
    gpu_cos_pool_disk_size_gb = 100
    tier1_pool_disk_size_gb = 100
    EOF
    
    gcloud secrets create broker-tfvars-selkies-min-ssd \
        --replication-policy=automatic \
        --data-file selkies-min-ssd.auto.tfvars
    
  • If the load balancer never comes online and you receive 500 errors after the deployment has completed for at least 30 minutes, the autoneg controller annotation may need to be reset:

    gcloud container clusters get-credentials broker-${REGION?}
    
    ./setup/scripts/fix_autoneg.sh