Improve Instance Startup Time
Description
When using FF Cloud, the startup time for instances is slow, too slow. We've hidden this well with the onboarding, but it would be useful to improve this during the general instance spin up too.
Made this an Epic as I'm sure there are multiple angles we could go here, but it needs to be recorded and investigated.
Which customers would this be available to
Everyone - CE/Starter/Team/Enterprise
### Related
- [ ] https://github.com/FlowFuse/CloudProject/issues/369
- [ ] https://github.com/FlowFuse/flowfuse/issues/3058
Same goes for installation of packages too, I know this is a CPU limitation e.g. on "Small" instances, but want to at least ask the question
We did think about using a sidecar pod without the CPU limitations to do the initial install of packages, but the k8s scheduler takes the highest value of all containers in a pod to determine the resources needed and to decide how many pods can be scheduled on a given node.
This would vastly reduce the number we can pack on to a node and impact margins.
We did think about using a sidecar pod without the CPU limitations to do the initial install of packages, but the k8s scheduler takes the highest value of all containers in a pod to determine the resources needed and to decide how many pods can be scheduled on a given node.
This would vastly reduce the number we can pack on to a node and impact margins.
That is valid untill you apply limits on the init container too. If we don't set any cpu limits for init container it will not affect effective CPU limits for whole pod. Just to be sure about the above, I have created a following pod:
apiVersion: v1
kind: Pod
metadata:
name: cpu-limits-stress-v6
spec:
nodeSelector:
kubernetes.io/hostname: ip-192-168-23-15.eu-west-1.compute.internal
initContainers:
- name: pod-cpu-init-stressor
image: narmidm/k8s-pod-cpu-stressor:1.0.0
args:
- "-cpu=1"
- "-duration=600s"
containers:
- name: pod-cpu-stressor
image: narmidm/k8s-pod-cpu-stressor:1.0.0
args:
- "-cpu=1"
- "-duration=600s"
resources:
requests:
cpu: 100m
limits:
cpu: 200m
In short, init container is created without cpu limits, using 1vCPU for 10 minutes. Container is using same cpu stress parameters but it has 0.1 soft and 0.2 hard cpu limits set. The CPU usage is as follows for one lifecycle for the pod:
As we can see, init container was using as much as it could while regular one was limited by the resources constraints. Also, since init container does not have any soft limits, they are not taken into the account during pod scheduling.
Instance start time is still an issue, only 50% of users are opening the Editor at all, most then leave and never come back. Upon reviewing the session recordings, in the vast majority of cases, it's because the instance is still "installing"
Couple of examples:
- https://eu.posthog.com/project/2209/replay/0190bcd2-4b6d-7779-84c7-6f50b793617c
- https://eu.posthog.com/project/2209/replay/0190a54c-5a80-7134-a3e2-3dcca72b60c2
Proposal by @ZJvandeWeg, what can we remove so that it starts up quicker? Debugger? Linter?
We can do some testing to see what impact removing debugger/linter makes. Ultimately, the issue is the resources available to Small instance types is going to keep it slow to start. We have had various discussions on technical approaches to speeding it up. Reading back on the last comments from @hardillb and @ppawlowski, I'm not sure if the init container approach is viable or not - so would appreciate their input here.
In Staging I have created a Template without linter/debugger so we can see the difference that makes.
- Small NRv4 instance with the default template (including linter/debugger):
- 61s - of which npm install took 37s
- Small NRv4 instance without linter/debugger:
- 38s - of which the npm install took 12s
This was a one-off test, so could have been other factors influencing the times, but I think it is reasonable indicative of the saving to be had.
Evidence enough for me - let's ship it 🛥
Wanted to make an issue for this, but not sure which repo it'd live in? CloudProject? Or is it an nr-launcher change?
Removing them from the template would be a CloudProject issue
Thanks Ben - opened & assigned
We did already have an open item to do that, but hadn't committed to doing it - have closed the old item off to avoid confusion.
I'm curious to know, for a 'clean' start of a new instance, what npm is actually doing for those 12 seconds - ideally the stack should contain everything needed for the base image. I suspect that will be things like project nodes and assistant modules being added in. If we can confirm that, then we should look at updating the base images to preinstall them.
I'm curious to know, for a 'clean' start of a new instance, what npm is actually doing for those 12 seconds - ideally the stack should contain everything needed for the base image
I actually just asked the same question here
If we can confirm that, then we should look at updating the base images to preinstall them.
Does that not just move the time to the base image being installed instead, or is that generally expected to be faster?
Does that not just move the time to the base image being installed instead, or is that generally expected to be faster?
The base image is the pre-built docker container - we run the npm install when we build and release the container, not when the user is starting a new instance.
npm shouldn't be doing anything as it's run in the /data directory and the dependencies should be empty.
Infact looking at the code in nr-launcher it shouldn't even run npm if there are no dependencies
https://github.com/FlowFuse/nr-launcher/blob/08b4ad0ce251b0edd06eca160e994d3bdd0512af/lib/launcher.js#L220