flowfuse icon indicating copy to clipboard operation
flowfuse copied to clipboard

Improve Instance Startup Time

Open joepavitt opened this issue 1 year ago • 15 comments

Description

When using FF Cloud, the startup time for instances is slow, too slow. We've hidden this well with the onboarding, but it would be useful to improve this during the general instance spin up too.

Made this an Epic as I'm sure there are multiple angles we could go here, but it needs to be recorded and investigated.

Which customers would this be available to

Everyone - CE/Starter/Team/Enterprise

### Related
- [ ] https://github.com/FlowFuse/CloudProject/issues/369
- [ ] https://github.com/FlowFuse/flowfuse/issues/3058

joepavitt avatar Feb 29 '24 15:02 joepavitt

Same goes for installation of packages too, I know this is a CPU limitation e.g. on "Small" instances, but want to at least ask the question

joepavitt avatar Mar 05 '24 11:03 joepavitt

We did think about using a sidecar pod without the CPU limitations to do the initial install of packages, but the k8s scheduler takes the highest value of all containers in a pod to determine the resources needed and to decide how many pods can be scheduled on a given node.

This would vastly reduce the number we can pack on to a node and impact margins.

hardillb avatar Mar 07 '24 09:03 hardillb

We did think about using a sidecar pod without the CPU limitations to do the initial install of packages, but the k8s scheduler takes the highest value of all containers in a pod to determine the resources needed and to decide how many pods can be scheduled on a given node.

This would vastly reduce the number we can pack on to a node and impact margins.

That is valid untill you apply limits on the init container too. If we don't set any cpu limits for init container it will not affect effective CPU limits for whole pod. Just to be sure about the above, I have created a following pod:

apiVersion: v1
kind: Pod
metadata:
  name: cpu-limits-stress-v6
spec:
  nodeSelector:
    kubernetes.io/hostname: ip-192-168-23-15.eu-west-1.compute.internal
  initContainers:
    - name: pod-cpu-init-stressor
      image: narmidm/k8s-pod-cpu-stressor:1.0.0
      args:
        - "-cpu=1"
        - "-duration=600s"
  containers:
    - name: pod-cpu-stressor
      image: narmidm/k8s-pod-cpu-stressor:1.0.0
      args:
        - "-cpu=1"
        - "-duration=600s"
      resources:
        requests:
          cpu: 100m
        limits:
          cpu: 200m

In short, init container is created without cpu limits, using 1vCPU for 10 minutes. Container is using same cpu stress parameters but it has 0.1 soft and 0.2 hard cpu limits set. The CPU usage is as follows for one lifecycle for the pod:

Zrzut ekranu 2024-03-14 o 19 25 12

As we can see, init container was using as much as it could while regular one was limited by the resources constraints. Also, since init container does not have any soft limits, they are not taken into the account during pod scheduling.

ppawlowski avatar Mar 14 '24 18:03 ppawlowski

Instance start time is still an issue, only 50% of users are opening the Editor at all, most then leave and never come back. Upon reviewing the session recordings, in the vast majority of cases, it's because the instance is still "installing"

Couple of examples:

  • https://eu.posthog.com/project/2209/replay/0190bcd2-4b6d-7779-84c7-6f50b793617c
  • https://eu.posthog.com/project/2209/replay/0190a54c-5a80-7134-a3e2-3dcca72b60c2

Proposal by @ZJvandeWeg, what can we remove so that it starts up quicker? Debugger? Linter?

joepavitt avatar Jul 17 '24 13:07 joepavitt

We can do some testing to see what impact removing debugger/linter makes. Ultimately, the issue is the resources available to Small instance types is going to keep it slow to start. We have had various discussions on technical approaches to speeding it up. Reading back on the last comments from @hardillb and @ppawlowski, I'm not sure if the init container approach is viable or not - so would appreciate their input here.

knolleary avatar Jul 17 '24 15:07 knolleary

In Staging I have created a Template without linter/debugger so we can see the difference that makes.

  • Small NRv4 instance with the default template (including linter/debugger):
    • 61s - of which npm install took 37s
  • Small NRv4 instance without linter/debugger:
    • 38s - of which the npm install took 12s

This was a one-off test, so could have been other factors influencing the times, but I think it is reasonable indicative of the saving to be had.

knolleary avatar Jul 17 '24 16:07 knolleary

Evidence enough for me - let's ship it 🛥

joepavitt avatar Jul 17 '24 16:07 joepavitt

Wanted to make an issue for this, but not sure which repo it'd live in? CloudProject? Or is it an nr-launcher change?

joepavitt avatar Jul 17 '24 18:07 joepavitt

Removing them from the template would be a CloudProject issue

hardillb avatar Jul 17 '24 18:07 hardillb

Thanks Ben - opened & assigned

joepavitt avatar Jul 18 '24 07:07 joepavitt

We did already have an open item to do that, but hadn't committed to doing it - have closed the old item off to avoid confusion.

I'm curious to know, for a 'clean' start of a new instance, what npm is actually doing for those 12 seconds - ideally the stack should contain everything needed for the base image. I suspect that will be things like project nodes and assistant modules being added in. If we can confirm that, then we should look at updating the base images to preinstall them.

knolleary avatar Jul 18 '24 08:07 knolleary

I'm curious to know, for a 'clean' start of a new instance, what npm is actually doing for those 12 seconds - ideally the stack should contain everything needed for the base image

I actually just asked the same question here

joepavitt avatar Jul 18 '24 08:07 joepavitt

If we can confirm that, then we should look at updating the base images to preinstall them.

Does that not just move the time to the base image being installed instead, or is that generally expected to be faster?

joepavitt avatar Jul 18 '24 08:07 joepavitt

Does that not just move the time to the base image being installed instead, or is that generally expected to be faster?

The base image is the pre-built docker container - we run the npm install when we build and release the container, not when the user is starting a new instance.

knolleary avatar Jul 18 '24 08:07 knolleary

npm shouldn't be doing anything as it's run in the /data directory and the dependencies should be empty.

Infact looking at the code in nr-launcher it shouldn't even run npm if there are no dependencies

https://github.com/FlowFuse/nr-launcher/blob/08b4ad0ce251b0edd06eca160e994d3bdd0512af/lib/launcher.js#L220

hardillb avatar Jul 18 '24 09:07 hardillb