jupyter-ai icon indicating copy to clipboard operation
jupyter-ai copied to clipboard

Proxy exhausts all ports and crashes JupyterHub

Open cboettig opened this issue 3 months ago • 8 comments

We are deploying jupyter-ai in a JupyterHub server at UC Berkeley supporting 120 students, where our datahub team is consistently seeing unexpected behavior that appears linked to the jupyter-ai handling of proxy ports that slowly exhausts available ports and crashes the entire hub.

I'll try and provide as much detail as I can, but the issue is not straight-forward to reproduce. I'd be happy to follow up offline or connect you to the datahub team at Berkeley more directly, as the detailed logs come from student activity we can't post directly here.

The datahub team describes the issue as follows:

From the logs, we observed that after a user server is culled (i.e., the route /user/ is removed by both the proxy and the hub), the proxy continues to attempt connections to endpoints like /user//api/ai and /user//api/collaboration/room. This behavior can be clearly seen in the logs I provided.

Because these routes are stale, the proxy repeatedly tries and fails to reach them, opening new ephemeral ports with each attempt. Over time, this leads to exhaustion of available ephemeral ports. This issue appears to be isolated to this particular hub; we have not seen similar behavior in other hubs.

When the proxy runs out of ephemeral ports, it cannot establish any new connections — which directly caused the outages we've observed over the past week. The crashes experienced by students during your class are a side effect of this issue. With thousands of ephemeral ports in use, the proxy consumes a significant amount of memory and eventually reaches a point where it can no longer handle the connection load.

Clearly this is a significant challenge for deploying this; it's fine for us to disable it for the time being, but wanted to share the report. I know there's probably some further specific details you might need to help track this down, please let me know if there's anything our team can do to help you (and us) figure out what is going wrong here. Appreciate all you do.

cc @balajialg (UC Berkeley DataHub)

cboettig avatar Sep 18 '25 18:09 cboettig

Thanks @cboettig! Cc'ing @yijunge-ucb and @felder since they are closely tracking this issue across different threads.

balajialg avatar Sep 18 '25 18:09 balajialg

Thanks @cboettig! Cc'ing @yijunge-ucb and @felder since they are closely tracking this issue across different threads.

subscribed (i don't have this deployed but have gotten asked about it previously)

shaneknapp avatar Sep 18 '25 18:09 shaneknapp

Is the proxy JupyterHub's configurable-http-proxy, or is this some other proxy?

Are you familiar with Playwright? I wonder if we could use it to automate multiple independent JupyterHub singleuser-server requests to reproduce the CHP port exhaustion independent of your deployment. If the problem is exacerbated by jupyter-ai that might make it easier to reproduce!

manics avatar Sep 18 '25 18:09 manics

Hi @manics,

We are using chp at Berkeley. I am not familiar with Playwright, but I think the problem can be reproduced. Last night I logged into Carl's server (to run the notebook he used in class to see if anything in the notebook were causing crashes), and typed several random commands in the AI chat interface. Then I left his server idle (assuming Carl did not log into his server last night ). Within an hour or two, I started seeing tons of entries like the following in the proxy logs

[ConfigProxy] error: 503 GET /user//api/ai/chats Error: connect EADDRNOTAVAIL

and I could confirm his server was already culled at that time.

yijunge-ucb avatar Sep 18 '25 19:09 yijunge-ucb

Thanks @cboettig for opening this issue and providing all these details! I am behind you on deploying j-ai on "my" (i.e. stat159) hub for this semester, but I'm keenly watching this space as I would likely run into the same problem immediately. Tagging also @ryanlovett for visibility, as any issues with a stat hub will also hit him first.

I wonder if the j-ai team has any insight on whether they expect this to potentially be fixed on the 3.0.0 beta branch? We could test a deployment with that one on a staging hub (I'm happy to do it on the 159 one, as I'm otherwise not making many changes and don't really need to use the staging one).

fperez avatar Sep 18 '25 19:09 fperez

@cboettig @fperez Hey, thank you all for reporting this issue. I'm sharing some context & a suggested fix to explore.

Jupyter Collaboration continuously retries the connection to /api/collaboration/room because it is not possible to distinguish between a server shutdown & a transient disconnect caused by network issues. Therefore, to avoid users experiencing data loss after a disconnect, that API is continuously polled by the extension.

Therefore, it's not clear to me whether halting re-connection attempts in Jupyter AI will fix the root cause of the issue for you, since continuous re-connects are done by other extensions as well as Jupyter AI. Halting re-connects may also cause a regression for users who are using Jupyter AI in unstable networks.

Does your proxy server have a heartbeat / timeout mechanism to avoid forwarding requests while the upstream server is shut down? This may be a better fix for your deployment.

dlqqq avatar Sep 19 '25 18:09 dlqqq

@Zsailer may also have experience with these port issues

ellisonbg avatar Sep 19 '25 19:09 ellisonbg

In pretty much all situations, reconnects should follow an exponential back-off and then eventually give up (or at least have a last interval on the order of minutes, ideally with manual 'reconnect now...' button, as seen in gmail, etc.), to avoid wasting resources. We should still be hardening the proxy against ill-behaved extensions and frontends.

It should perhaps be in JupyterLab's core utilities to provide a utility for gracefully reconnecting, as is already done for API requests, I believe, and kernel websockets (formerly the only ones, which is perhaps why this problem has grown recently) have this behavior, as, it would appear, does the events/subscribe socket.

minrk avatar Sep 23 '25 19:09 minrk