materialize
materialize copied to clipboard
[Epic] `SIZE` Parameter in `CREATE SOURCE`
Feature request
We may also want to allow an ALTER SOURCE
statement so that users can size down their throughput parameter after say, ingesting an initial Postgres snapshot.
The work for this epic requires some discussion and agreement on the syntax/UX (by product or the surfaces team), and the storage team to do the actual implementation.
Time Horizon
Small
Work items
- [x] Ratify design document
- [x] #13881
- [x] Update parser to support
SIZE
parameter~, with the default being the smallest size available~ - [x] Update
CreateSourcePlan
to include sizes - [x] Update
CollectionDescription
to include sizes - [x] Update
provision
/start_storage_host
in the Storage controller to receive the sizes - [x] Add a new
--storage-host-sizes
parameter to let Cloud define sizes for storage hosts independently of Compute replica sizes - [x] Add a new
--default-storage-host-size
parameter to let Cloud define the default size, instead of defaulting to the smallest one available - [x] Add storage host size map to Cloud settings
- [x] Wire up storage host size map to Cloud deployments via CLI once it's fully released
- [x] Allow
SIZE
outside of unsafe mode - [ ] Scope remaining work items for
ALTER SOURCE
support.- [ ] Add new statement in the SQL layer
- [ ] Add new method in the storage controller to update/upsert storage host settings via
ensure_service
/reprovision
- [ ] QA sign-off
Blockers
None
Folks generally seem to think that SIZE
is a better name than THROUGHPUT
, so I've updated the title of this epic accordingly. The first step here will be to ratify the design document with product/devex.
Can we get some hard-coded default sizes until we figure this out? It's breaking production to not have limits set at all.
It's breaking production to not have limits set at all.
Do we run the Kube nodes with any kube/system resource allocations? We might be able to mark off some memory for system processes so pods get killed before host-level processes via https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/
We do set the kubeReserved
, but they are the default values and are likely too low. We don't set systemReserved
.
Here's the /etc/kubernetes/kubelet/kubelet-config.json from a staging node (should be the same for prod):
{
"kind": "KubeletConfiguration",
"apiVersion": "kubelet.config.k8s.io/v1beta1",
"address": "0.0.0.0",
"authentication": {
"anonymous": {
"enabled": false
},
"webhook": {
"cacheTTL": "2m0s",
"enabled": true
},
"x509": {
"clientCAFile": "/etc/kubernetes/pki/ca.crt"
}
},
"authorization": {
"mode": "Webhook",
"webhook": {
"cacheAuthorizedTTL": "5m0s",
"cacheUnauthorizedTTL": "30s"
}
},
"clusterDomain": "cluster.local",
"hairpinMode": "hairpin-veth",
"readOnlyPort": 0,
"cgroupDriver": "cgroupfs",
"cgroupRoot": "/",
"featureGates": {
"RotateKubeletServerCertificate": true,
"KubeletCredentialProviders": true
},
"protectKernelDefaults": true,
"serializeImagePulls": false,
"serverTLSBootstrap": true,
"tlsCipherSuites": [
"TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256",
"TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256",
"TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305",
"TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384",
"TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305",
"TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384",
"TLS_RSA_WITH_AES_256_GCM_SHA384",
"TLS_RSA_WITH_AES_128_GCM_SHA256"
],
"clusterDNS": [
"10.200.32.10"
],
"evictionHard": {
"memory.available": "100Mi",
"nodefs.available": "10%",
"nodefs.inodesFree": "5%"
},
"kubeReserved": {
"cpu": "150m",
"ephemeral-storage": "1Gi",
"memory": "2829Mi"
},
"maxPods": 234
}
Maybe I'm reading those docs wrong, but it looks like those are not actually reserved, just not considered in scheduling decisions. If a pod doesn't have any limit set, it can still wreak havok on a node, well into the kubeReserved
memory. We still need every pod to have a limit set so that it is the thing that dies when it starts using too much RAM.
The evictionHard
parameters look like they might be useful for ensuring that a user pod gets killed first, but that 100Mi value is likely waaaaay too low.
I think we have several action items:
- Ensure every pod has limits set. a. ~Storaged is confirmed missing this.~ Thanks Aljosha! b. Some of our system pods are likely missing this too.
- Increase the
kubeReserved
cpu
andmemory
values, as the defaults are quite low. Note that these are only for the non-pod K8S daemons on the node, such as the kubelet. - Specify the
systemReserved
cpu
andmemory
values, since these are not set at all currently. - Increase the
evictionHard
memory.available
value, so that things get evicted earlier in low-memory situations. - Create and specify a cgroup for
kubeReserved
daemons. - Create and specify a cgroup for
systemReserved
daemons. - Add an admission controller to prevent pods from ever being launched without limits set.
Removing the release blocker label because I don't think this needs to block the release of v0.27.0-alpha.4. I agree it's a P1, though!
@hlburak If possible, I'd like to get 1-4 from the list of action items above into M2.
So quick status update:
- This whole epic is slated for M2.
- Alessandro landed a fixed limit for
storaged
processes https://github.com/MaterializeInc/materialize/pull/13881. That'll be in today's release. - Alessandro is actively working on the other items.
@andrioni, do you have a rough estimate for how long it'll take to everything but ALTER SOURCE
?
I'm going to make a separate epic for the follow up actions I listed, as they aren't directly related to this SIZE
parameter, and I don't want to confuse this discussion.
https://github.com/MaterializeInc/cloud/issues/3628
Oh sorry, I understand now. Yes, that'd be great Alex, thanks!
PR should be open sometime this week! I already have everything fully wired up here and "working", now it's just a matter of making it nice (i.e. not passing the size as a string up until the storage controller, implementing a ComputeInstanceReplicaAllocation
analogue, all the TODO
s in the commit). If you want to have something quick and dirty, I could probably prepare a PR today for this release and then do the refactors later, as the SQL parsing part should be final already, and nothing else is persisted (since the Source
parameters are persisted via SQL).
Eh, it was simpler than I expected: PR open #13891 for everything before ALTER SOURCE
.
We now have defaults in place. I'm moving the priority down to P2.
Just merged the ALTER SIZE
support. This supports SET
and RESET
for both SIZE
and REMOTE
, and should automatically scale up and down in k8s as necessary.
The PR includes some basic testing, but the proper run through QA is still pending. I've got a minor followup to do as well, shifting some of the source rewriting into the catalog.
@bkirwi, do you mind filing an issue for the follow up?
Linking the issue for visibility: https://github.com/MaterializeInc/materialize/issues/14494
Should we close this issue as done, @uce?
We are still waiting for @philip-stoev's sign-off, I believe.
QA sign-off blocked by https://github.com/MaterializeInc/materialize/issues/15300
@philip-stoev are you still blocked on QA sign-off now that #15300 is closed?
SIZE '4-1' and SIZE '4-4' now appear to work reasonably well so I am signing off this epic, even if we can not switch the entire CI to SIZE '4-4' at this time. I will track any residual bugs as separate tickets.