lima icon indicating copy to clipboard operation
lima copied to clipboard

Stuck network when using multiple connections with k3s

Open JanPokorny opened this issue 6 months ago • 8 comments

Summary

When running a non-trivial web app in a Lima VM using the k3s template, the network stops being responsive. This seems to happen when multiple persistent HTTP connections are established. With non-trivial web apps which load many resources, this can be triggered by simply opening the app in two tabs at once.

Reproduction

  1. Prepare these files (somewhere under your ~):

Dockerfile

FROM python:3.11-slim

ARG CSS_COUNT=1000
ENV CSS_COUNT=${CSS_COUNT}

WORKDIR /app

RUN cat <<EOF >main.py
from fastapi import FastAPI
from fastapi.staticfiles import StaticFiles

class NoCacheStaticFiles(StaticFiles):
    async def get_response(self, path: str, scope):
        response = await super().get_response(path, scope)
        response.headers["Cache-Control"] = "no-cache, no-store, must-revalidate"
        response.headers["Pragma"] = "no-cache"
        response.headers["Expires"] = "0"
        return response

app = FastAPI()
app.mount("/", NoCacheStaticFiles(directory="static", html=True), name="static")
EOF

RUN mkdir static

RUN echo '<!DOCTYPE html>' > static/index.html && \
    echo '<html><head><title>Test Lima Port Bug</title>' >> static/index.html && \
    for i in $(seq 0 $((${CSS_COUNT} - 1))); do \
        echo "  <link rel=\"stylesheet\" href=\"$i.css\">" >> static/index.html; \
    done && \
    echo '</head><body><h1>Hello from FastAPI static site!</h1></body></html>' >> static/index.html

RUN for i in $(seq 0 $((${CSS_COUNT} - 1))); do \
        echo -n "/* " > static/$i.css && \
        head -c 500000 /dev/urandom | tr -dc 'a-zA-Z0-9' | fold -w 80 | head -n 100 >> static/$i.css && \
        echo " */" >> static/$i.css; \
    done

RUN pip install fastapi uvicorn

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

k8s.yaml

apiVersion: v1
kind: Service
metadata:
  name: repro-service
spec:
  type: NodePort
  selector:
    app: repro
  ports:
    - port: 8000
      nodePort: 31833
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: test-repro
spec:
  replicas: 1
  selector:
    matchLabels:
      app: repro
  template:
    metadata:
      labels:
        app: repro
    spec:
      containers:
        - name: repro
          image: repro:local
          ports:
            - containerPort: 8000
  1. Build image and start the server (this needs a local docker CLI to build the image, I use Colima, it probably does not matter):
docker build -t repro:local .
docker save repro:local >repro.tar
limactl --tty=false start template://k3s --name=repro --mount=~
limactl --tty=false shell repro -- sudo ctr images import repro.tar
limactl --tty=false shell repro -- kubectl apply -f k8s.yaml
  1. Open http://localhost:31833/ in browser. It will load just fine the first time. Leave the tab open and open http://localhost:31833/ in another tab. Repeat and open a few tabs like this, until you'll notice that the page stops loading, being stuck in a "pending" state. The stuck pages will sometimes load eventually after 20s or so, some will time out.

What I discovered

It appears that the new tabs are unable to establish TCP connections to the application. This seems to be related to the number of existing HTTP persistent connections (Connection: keep-alive). When I force Connection: close in the server, the problem disappears. And it's the reason why the bug only manifests when multiple tabs are open, as browsers hold separate connections per tab. The huge number of dummy CSSs is there just to force the browser to open the max number of connections.

Notably this bug does not happen when using kubectl port-forward instead of the NodePort, so the problem is somewhere in the Lima networking stack or perhaps the way it interacts with the k3s networking stack.

Versions

macOS 15.5 (24F74) limactl version 1.1.1 (installed from Homebrew) happens on both VZ and QEMU VMs

JanPokorny avatar Jun 03 '25 13:06 JanPokorny

Tried different versions of Lima and it appears that this is a regression between 1.0.7 and 1.1.0.

JanPokorny avatar Jun 03 '25 16:06 JanPokorny

I had some problems initially to reproduce it; I was opening lots of tabs with http://localhost:31833 quickly, and they all worked. But when I waited a bit between opening tabs, I would eventually see the failure.

The corresponding error in the hostagent log is:

{"error":"close tcp 127.0.0.1:6443-\u003e127.0.0.1:55606: shutdown: socket is not connected","level":"debug","msg":"failed to call CloseRead","time":"2025-06-03T10:18:56-07:00"}

It seems to be a problem with the GRPC port forwarder. When I disabled it, I could no longer reproduce the issue:

export LIMA_SSH_PORT_FORWARDER=true

It needs to be set before you start the instance. Could you please try and confirm that this "fixes" the issue for you as well?

jandubois avatar Jun 03 '25 19:06 jandubois

Thanks for the bug, I think the problem is that the connection is not reused if that keep alive comes up. We still dial for new connection.

I will check on this and try to fix it.

balajiv113 avatar Jun 04 '25 03:06 balajiv113

@jandubois Yes, LIMA_SSH_PORT_FORWARDER=true works. Thank you for providing the workaround!

JanPokorny avatar Jun 04 '25 09:06 JanPokorny

It seems like I got the same problem --- and heavy use of S3 connection from chrome to minio.

mabels avatar Jun 11 '25 15:06 mabels

@mabels And does switching to the SSH port forwarder fix things for you as well?

jandubois avatar Jun 11 '25 16:06 jandubois

It does

On Wed 11. Jun 2025 at 18:27, Jan Dubois @.***> wrote:

jandubois left a comment (lima-vm/lima#3601) https://github.com/lima-vm/lima/issues/3601#issuecomment-2963465939

@mabels https://github.com/mabels And does switching to the SSH port forwarder fix things for you as well?

— Reply to this email directly, view it on GitHub https://github.com/lima-vm/lima/issues/3601#issuecomment-2963465939, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAEWJSNPPJW3CGC5CZ5OBL3DBKHZAVCNFSM6AAAAAB6PXBSG6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDSNRTGQ3DKOJTHE . You are receiving this because you were mentioned.Message ID: @.***>

mabels avatar Jun 12 '25 13:06 mabels

I've added the "priority/high" label to this issue. I think we need to either find a fix for it for 1.1.2, or revert the default back to SSH.

@balajiv113 Do you expect to have time to look into this, or would you want to revert back the default first, to have more time?

jandubois avatar Jun 12 '25 18:06 jandubois

Connection: keep-alive

Is it possible to reproduce this issue without using k3s?

AkihiroSuda avatar Jun 30 '25 07:06 AkihiroSuda

I have the problem without k3s --- just a service on the VM, in my case, a Docker container running minio.

mabels avatar Jun 30 '25 07:06 mabels

I think I found a minimal repro of this issue:

lima python3 -m http.server
telnet localhost 8000

"Connection closed by foreign host." after pressing the RET key :

  • SSH: once
  • gRPC: 3 times

The cause seems that the gRPC portfwd does not implement TCP half-close.

AkihiroSuda avatar Jul 03 '25 08:07 AkihiroSuda

Implementing TCP half-close may require non-trivial changes to the TunnelMessage messages

I think we need to either find a fix for it for 1.1.2, or revert the default back to SSH.

If we are going to revert the default back to SSH again, we should probably never promote gRPC to the default again. The default mode table already looks quite clumsy: https://github.com/lima-vm/lima/blob/53d718628f519dc6702f99473c5de343ac46ce62/website/content/en/docs/config/port.md?plain=1#L14-L22

AkihiroSuda avatar Jul 03 '25 09:07 AkihiroSuda

Hi,

I don’t think it is about half open connections — my error happens on http and the http spec does not allow one sides shutdowns — Beside this I would assume that the problem is something with TCP_DELAY.

If ssh can do it don’t give up

Meno

On Thu 3. Jul 2025 at 11:45, Akihiro Suda @.***> wrote:

AkihiroSuda left a comment (lima-vm/lima#3601) https://github.com/lima-vm/lima/issues/3601#issuecomment-3031611168

Implementing TCP half-close may require non-trivial changes to the TunnelMessage messages

I think we need to either find a fix for it for 1.1.2, or revert the default back to SSH.

If we are going to revert the default back to SSH again, we should probably never promote gRPC to the default again. The default mode table already looks quite clumsy:

https://github.com/lima-vm/lima/blob/53d718628f519dc6702f99473c5de343ac46ce62/website/content/en/docs/config/port.md?plain=1#L14-L22

— Reply to this email directly, view it on GitHub https://github.com/lima-vm/lima/issues/3601#issuecomment-3031611168, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAEWJWDEUZW2JD6ZGO4T2L3GT3UZAVCNFSM6AAAAAB6PXBSG6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTAMZRGYYTCMJWHA . You are receiving this because you were mentioned.Message ID: @.***>

mabels avatar Jul 03 '25 16:07 mabels

We still have:

  • https://github.com/lima-vm/lima/issues/3685

AkihiroSuda avatar Jul 04 '25 06:07 AkihiroSuda