materialize icon indicating copy to clipboard operation
materialize copied to clipboard

[Epic] `SIZE` Parameter in `CREATE SOURCE`

Open nmeagan11 opened this issue 2 years ago • 12 comments

Feature request

Internal Background

We may also want to allow an ALTER SOURCE statement so that users can size down their throughput parameter after say, ingesting an initial Postgres snapshot.

The work for this epic requires some discussion and agreement on the syntax/UX (by product or the surfaces team), and the storage team to do the actual implementation.

Time Horizon

Small

Work items

  • [x] Ratify design document
  • [x] #13881
  • [x] Update parser to support SIZE parameter~, with the default being the smallest size available~
  • [x] Update CreateSourcePlan to include sizes
  • [x] Update CollectionDescription to include sizes
  • [x] Update provision/start_storage_host in the Storage controller to receive the sizes
  • [x] Add a new --storage-host-sizes parameter to let Cloud define sizes for storage hosts independently of Compute replica sizes
  • [x] Add a new --default-storage-host-size parameter to let Cloud define the default size, instead of defaulting to the smallest one available
  • [x] Add storage host size map to Cloud settings
  • [x] Wire up storage host size map to Cloud deployments via CLI once it's fully released
  • [x] Allow SIZE outside of unsafe mode
  • [ ] Scope remaining work items for ALTER SOURCE support.
    • [ ] Add new statement in the SQL layer
    • [ ] Add new method in the storage controller to update/upsert storage host settings via ensure_service/reprovision
  • [ ] QA sign-off

Blockers

None

nmeagan11 avatar Jun 15 '22 18:06 nmeagan11

Folks generally seem to think that SIZE is a better name than THROUGHPUT, so I've updated the title of this epic accordingly. The first step here will be to ratify the design document with product/devex.

benesch avatar Jul 09 '22 20:07 benesch

Can we get some hard-coded default sizes until we figure this out? It's breaking production to not have limits set at all.

alex-hunt-materialize avatar Jul 25 '22 17:07 alex-hunt-materialize

It's breaking production to not have limits set at all.

Do we run the Kube nodes with any kube/system resource allocations? We might be able to mark off some memory for system processes so pods get killed before host-level processes via https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/

pH14 avatar Jul 25 '22 18:07 pH14

We do set the kubeReserved, but they are the default values and are likely too low. We don't set systemReserved.

Here's the /etc/kubernetes/kubelet/kubelet-config.json from a staging node (should be the same for prod):

{
  "kind": "KubeletConfiguration",
  "apiVersion": "kubelet.config.k8s.io/v1beta1",
  "address": "0.0.0.0",
  "authentication": {
    "anonymous": {
      "enabled": false
    },
    "webhook": {
      "cacheTTL": "2m0s",
      "enabled": true
    },
    "x509": {
      "clientCAFile": "/etc/kubernetes/pki/ca.crt"
    }
  },
  "authorization": {
    "mode": "Webhook",
    "webhook": {
      "cacheAuthorizedTTL": "5m0s",
      "cacheUnauthorizedTTL": "30s"
    }
  },
  "clusterDomain": "cluster.local",
  "hairpinMode": "hairpin-veth",
  "readOnlyPort": 0,
  "cgroupDriver": "cgroupfs",
  "cgroupRoot": "/",
  "featureGates": {
    "RotateKubeletServerCertificate": true,
    "KubeletCredentialProviders": true
  },
  "protectKernelDefaults": true,
  "serializeImagePulls": false,
  "serverTLSBootstrap": true,
  "tlsCipherSuites": [
    "TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256",
    "TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256",
    "TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305",
    "TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384",
    "TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305",
    "TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384",
    "TLS_RSA_WITH_AES_256_GCM_SHA384",
    "TLS_RSA_WITH_AES_128_GCM_SHA256"
  ],
  "clusterDNS": [
    "10.200.32.10"
  ],
  "evictionHard": {
    "memory.available": "100Mi",
    "nodefs.available": "10%",
    "nodefs.inodesFree": "5%"
  },
  "kubeReserved": {
    "cpu": "150m",
    "ephemeral-storage": "1Gi",
    "memory": "2829Mi"
  },
  "maxPods": 234
}

Maybe I'm reading those docs wrong, but it looks like those are not actually reserved, just not considered in scheduling decisions. If a pod doesn't have any limit set, it can still wreak havok on a node, well into the kubeReserved memory. We still need every pod to have a limit set so that it is the thing that dies when it starts using too much RAM.

The evictionHard parameters look like they might be useful for ensuring that a user pod gets killed first, but that 100Mi value is likely waaaaay too low.

I think we have several action items:

  1. Ensure every pod has limits set. a. ~Storaged is confirmed missing this.~ Thanks Aljosha! b. Some of our system pods are likely missing this too.
  2. Increase the kubeReserved cpu and memory values, as the defaults are quite low. Note that these are only for the non-pod K8S daemons on the node, such as the kubelet.
  3. Specify the systemReserved cpu and memory values, since these are not set at all currently.
  4. Increase the evictionHard memory.available value, so that things get evicted earlier in low-memory situations.
  5. Create and specify a cgroup for kubeReserved daemons.
  6. Create and specify a cgroup for systemReserved daemons.
  7. Add an admission controller to prevent pods from ever being launched without limits set.

alex-hunt-materialize avatar Jul 25 '22 22:07 alex-hunt-materialize

Removing the release blocker label because I don't think this needs to block the release of v0.27.0-alpha.4. I agree it's a P1, though!

benesch avatar Jul 26 '22 05:07 benesch

@hlburak If possible, I'd like to get 1-4 from the list of action items above into M2.

alex-hunt-materialize avatar Jul 26 '22 15:07 alex-hunt-materialize

So quick status update:

  • This whole epic is slated for M2.
  • Alessandro landed a fixed limit for storaged processes https://github.com/MaterializeInc/materialize/pull/13881. That'll be in today's release.
  • Alessandro is actively working on the other items.

@andrioni, do you have a rough estimate for how long it'll take to everything but ALTER SOURCE?

benesch avatar Jul 26 '22 16:07 benesch

I'm going to make a separate epic for the follow up actions I listed, as they aren't directly related to this SIZE parameter, and I don't want to confuse this discussion.

https://github.com/MaterializeInc/cloud/issues/3628

alex-hunt-materialize avatar Jul 26 '22 16:07 alex-hunt-materialize

Oh sorry, I understand now. Yes, that'd be great Alex, thanks!

benesch avatar Jul 26 '22 16:07 benesch

PR should be open sometime this week! I already have everything fully wired up here and "working", now it's just a matter of making it nice (i.e. not passing the size as a string up until the storage controller, implementing a ComputeInstanceReplicaAllocation analogue, all the TODOs in the commit). If you want to have something quick and dirty, I could probably prepare a PR today for this release and then do the refactors later, as the SQL parsing part should be final already, and nothing else is persisted (since the Source parameters are persisted via SQL).

andrioni avatar Jul 26 '22 16:07 andrioni

Eh, it was simpler than I expected: PR open #13891 for everything before ALTER SOURCE.

andrioni avatar Jul 26 '22 19:07 andrioni

We now have defaults in place. I'm moving the priority down to P2.

uce avatar Aug 08 '22 11:08 uce

Just merged the ALTER SIZE support. This supports SET and RESET for both SIZE and REMOTE, and should automatically scale up and down in k8s as necessary.

The PR includes some basic testing, but the proper run through QA is still pending. I've got a minor followup to do as well, shifting some of the source rewriting into the catalog.

bkirwi avatar Aug 24 '22 19:08 bkirwi

@bkirwi, do you mind filing an issue for the follow up?

Linking the issue for visibility: https://github.com/MaterializeInc/materialize/issues/14494

nmeagan11 avatar Aug 29 '22 15:08 nmeagan11

Should we close this issue as done, @uce?

andrioni avatar Oct 05 '22 16:10 andrioni

We are still waiting for @philip-stoev's sign-off, I believe.

nmeagan11 avatar Oct 07 '22 00:10 nmeagan11

QA sign-off blocked by https://github.com/MaterializeInc/materialize/issues/15300

philip-stoev avatar Oct 10 '22 07:10 philip-stoev

@philip-stoev are you still blocked on QA sign-off now that #15300 is closed?

nmeagan11 avatar Oct 18 '22 16:10 nmeagan11

SIZE '4-1' and SIZE '4-4' now appear to work reasonably well so I am signing off this epic, even if we can not switch the entire CI to SIZE '4-4' at this time. I will track any residual bugs as separate tickets.

philip-stoev avatar Nov 01 '22 10:11 philip-stoev