enterprise_gateway
enterprise_gateway copied to clipboard
Number of notebooks with EG
Using Jupyter Enterprise Gateway with Swarm resource manager. Connected notebook server (using (elyra/nb2kg)) to EG.
Tried to open 150 python_on_docker kernel notebooks it works fine. However on trying to open 200 notebook it starts throwing lots of error. [E 22:35:37.263 NotebookApp] Exception writing message to websocket: [E 22:37:18.007 NotebookApp] Exception writing message to websocket: [E 22:37:18.009 NotebookApp] Exception writing message to websocket:
Has anyone tested/tried load test, testing how many notebooks/load EG can support?
Environment
- Enterprise Gateway Version 2.0.0.dev3
- Platform: Docker Swarm
@mihirkapadiap - thanks for opening this issue. I believe sites have supported more notebooks, but I'm not sure at what point additional EG servers get configured.
Since the messages you present are logged on the client side, can you provide the log from EG?
And are there any details that follow each of the Exception writing message to websocket:
messages?
Hi Kevin,
Getting socket timeout (it worked fine till 168 notebooks or so thereafter it started throwing error. Also it restarted Jupyter Enterprise Gateway.
I start EG as:
docker stack deploy -c docker-compose.yml enterprise-gateway
Notebook server:
docker run -t --rm
-e KG_URL='http://<EG_HOST>:8888'
-e KG_HTTP_USER=guest
-e KG_HTTP_PASS=guest-password
-p 9999:8888
-e VALIDATE_KG_CERT='no'
-e LOG_LEVEL=DEBUG
-e KG_REQUEST_TIMEOUT=60
-e KG_CONNECT_TIMEOUT=60
-v ${HOME}/notebooks/:/tmp/notebooks
-w /tmp/notebooks
elyra/nb2kg
Removed prefix (enterprise-gateway_enterprise-gateway.1.8r05e9z6m4zm@clv235sl-c342e3) from log lines!
|[D2019-09-0317:49:26.703EnterpriseGatewayApp]activityon9715ded9-f6da-4627-ab12-ec8647a4e70d:execute_input
|[D2019-09-0317:49:26.718EnterpriseGatewayApp]activityon8d53e04e-a0ed-4b90-85a0-22a43acb7dfc:status
|[D2019-09-0317:49:26.821EnterpriseGatewayApp]activityon0da31c85-7f68-487a-aa6a-5ab8cfa81879:execute_result
|[E19090317:50:31ioloop:909]Exceptionincallback<boundmethodKernelRestarter.pollof<jupyter_client.ioloop.restarter.IOLoopKernelRestarterobjectat0x7f449c22d2e8>>
|Traceback(mostrecentcalllast):
|File"/opt/conda/lib/python3.7/site-packages/urllib3/connectionpool.py",line384,in_make_request
|six.raise_from(e,None)
|File"
@mihirkapadiap - thanks for the update. I took the liberty to strip out the prefix such that it doesn't also remove spaces...
[D 2019-09-03 17:49:26.718 EnterpriseGatewayApp] activity on 8d53e04e-a0ed-4b90-85a0-22a43acb7dfc: status
[D 2019-09-03 17:49:26.821 EnterpriseGatewayApp] activity on 0da31c85-7f68-487a-aa6a-5ab8cfa81879: execute_result
[E 190903 17:50:31 ioloop:909] Exception in callback <bound method KernelRestarter.poll of <jupyter_client.ioloop.restarter.IOLoopKernelRestarter object at 0x7f449c22d2e8>>
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/urllib3/connectionpool.py", line 384, in _make_request
six.raise_from(e, None)
File "", line 2, in raise_from
File "/opt/conda/lib/python3.7/site-packages/urllib3/connectionpool.py", line 380, in _make_request
httplib_response = conn.getresponse()
File "/opt/conda/lib/python3.7/http/client.py", line 1321, in getresponse
response.begin()
File "/opt/conda/lib/python3.7/http/client.py", line 296, in begin
version, status, reason = self._read_status()
File "/opt/conda/lib/python3.7/http/client.py", line 257, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/opt/conda/lib/python3.7/socket.py", line 589, in readinto
return self._sock.recv_into(b)
socket.timeout: timed out
enterprise-gateway_enterprise-gateway.1.8r05e9z6m4zm@clv235sl-c342e3 |
During handling of the above exception, another exception occurred:
enterprise-gateway_enterprise-gateway.1.8r05e9z6m4zm@clv235sl-c342e3 |
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/requests/adapters.py", line 449, in send
timeout=timeout
File "/opt/conda/lib/python3.7/site-packages/urllib3/connectionpool.py", line 638, in urlopen
_stacktrace=sys.exc_info()[2])
File "/opt/conda/lib/python3.7/site-packages/urllib3/util/retry.py", line 368, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/opt/conda/lib/python3.7/site-packages/urllib3/packages/six.py", line 686, in reraise
raise value
File "/opt/conda/lib/python3.7/site-packages/urllib3/connectionpool.py", line 600, in urlopen
chunked=chunked)
File "/opt/conda/lib/python3.7/site-packages/urllib3/connectionpool.py", line 386, in _make_request
self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
File "/opt/conda/lib/python3.7/site-packages/urllib3/connectionpool.py", line 306, in _raise_timeout
raise ReadTimeoutError(self, url, "Read timed out. (read timeout=%s)" % timeout_value)
urllib3.exceptions.ReadTimeoutError: UnixHTTPConnectionPool(host='localhost', port=None): Read timed out. (read timeout=60)
enterprise-gateway_enterprise-gateway.1.8r05e9z6m4zm@clv235sl-c342e3 |
During handling of the above exception, another exception occurred:
enterprise-gateway_enterprise-gateway.1.8r05e9z6m4zm@clv235sl-c342e3 |
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/tornado/ioloop.py", line 907, in _run
return self.callback()
File "/opt/conda/lib/python3.7/site-packages/jupyter_client/restarter.py", line 93, in poll
if not self.kernel_manager.is_alive():
File "/opt/conda/lib/python3.7/site-packages/jupyter_client/manager.py", line 453, in is_alive
if self.kernel.poll() is None:
File "/opt/conda/lib/python3.7/site-packages/enterprise_gateway/services/processproxies/container.py", line 112, in poll
container_status = self.get_container_status(None)
File "/opt/conda/lib/python3.7/site-packages/enterprise_gateway/services/processproxies/docker_swarm.py", line 74, in get_container_status
task = self._get_task()
File "/opt/conda/lib/python3.7/site-packages/enterprise_gateway/services/processproxies/docker_swarm.py", line 58, in _get_task
tasks = service.tasks(filters={'desired-state': 'running'})
File "/opt/conda/lib/python3.7/site-packages/docker/models/services.py", line 54, in tasks
return self.client.api.tasks(filters=filters)
File "/opt/conda/lib/python3.7/site-packages/docker/utils/decorators.py", line 34, in wrapper
return f(self, *args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/docker/api/service.py", line 358, in tasks
return self._result(self._get(url, params=params), True)
File "/opt/conda/lib/python3.7/site-packages/docker/utils/decorators.py", line 46, in inner
return f(self, *args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/docker/api/client.py", line 230, in _get
return self.get(url, **self._set_request_timeout(kwargs))
File "/opt/conda/lib/python3.7/site-packages/requests/sessions.py", line 546, in get
return self.request('GET', url, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/requests/sessions.py", line 533, in request
resp = self.send(prep, **send_kwargs)
File "/opt/conda/lib/python3.7/site-packages/requests/sessions.py", line 646, in send
r = adapter.send(request, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/requests/adapters.py", line 529, in send
raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: UnixHTTPConnectionPool(host='localhost', port=None): Read timed out. (read timeout=60)
[E 190903 17:52:46 ioloop:909] Exception in callback <bound method KernelRestarter.poll of <jupyter_client.ioloop.restarter.IOLoopKernelRestarter object at 0x7f449c744710>>
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/urllib3/connectionpool.py", line 384, in _make_request
six.raise_from(e, None)
File "", line 2, in raise_from
File "/opt/conda/lib/python3.7/site-packages/urllib3/connectionpool.py", line 380, in _make_request
httplib_response = conn.getresponse()
File "/opt/conda/lib/python3.7/http/client.py", line 1321, in getresponse
response.begin()
File "/opt/conda/lib/python3.7/http/client.py", line 296, in begin
version, status, reason = self._read_status()
File "/opt/conda/lib/python3.7/http/client.py", line 257, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/opt/conda/lib/python3.7/socket.py", line 589, in readinto
return self._sock.recv_into(b)
socket.timeout: timed out
enterprise-gateway_enterprise-gateway.1.8r05e9z6m4zm@clv235sl-c342e3 |
During handling of the above exception, another exception occurred:
enterprise-gateway_enterprise-gateway.1.8r05e9z6m4zm@clv235sl-c342e3 |
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/requests/adapters.py", line 449, in send
timeout=timeout
File "/opt/conda/lib/python3.7/site-packages/urllib3/connectionpool.py", line 638, in urlopen
_stacktrace=sys.exc_info()[2])
File "/opt/conda/lib/python3.7/site-packages/urllib3/util/retry.py", line 368, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/opt/conda/lib/python3.7/site-packages/urllib3/packages/six.py", line 686, in reraise
raise value
File "/opt/conda/lib/python3.7/site-packages/urllib3/connectionpool.py", line 600, in urlopen
chunked=chunked)
File "/opt/conda/lib/python3.7/site-packages/urllib3/connectionpool.py", line 386, in _make_request
self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
File "/opt/conda/lib/python3.7/site-packages/urllib3/connectionpool.py", line 306, in _raise_timeout
raise ReadTimeoutError(self, url, "Read timed out. (read timeout=%s)" % timeout_value)
urllib3.exceptions.ReadTimeoutError: UnixHTTPConnectionPool(host='localhost', port=None): Read timed out. (read timeout=60)
enterprise-gateway_enterprise-gateway.1.8r05e9z6m4zm@clv235sl-c342e3 |
During handling of the above exception, another exception occurred:
enterprise-gateway_enterprise-gateway.1.8r05e9z6m4zm@clv235sl-c342e3 |
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/tornado/ioloop.py", line 907, in _run
return self.callback()
File "/opt/conda/lib/python3.7/site-packages/jupyter_client/restarter.py", line 93, in poll
if not self.kernel_manager.is_alive():
File "/opt/conda/lib/python3.7/site-packages/jupyter_client/manager.py", line 453, in is_alive
if self.kernel.poll() is None:
File "/opt/conda/lib/python3.7/site-packages/enterprise_gateway/services/processproxies/container.py", line 112, in poll
container_status = self.get_container_status(None)
File "/opt/conda/lib/python3.7/site-packages/enterprise_gateway/services/processproxies/docker_swarm.py", line 74, in get_container_status
task = self._get_task()
File "/opt/conda/lib/python3.7/site-packages/enterprise_gateway/services/processproxies/docker_swarm.py", line 56, in _get_task
service = self._get_service()
File "/opt/conda/lib/python3.7/site-packages/enterprise_gateway/services/processproxies/docker_swarm.py", line 40, in _get_service
services = client.services.list(filters={'label': 'kernel_id=' + self.kernel_id})
File "/opt/conda/lib/python3.7/site-packages/docker/models/services.py", line 269, in list
for s in self.client.api.services(**kwargs)
File "/opt/conda/lib/python3.7/site-packages/docker/utils/decorators.py", line 34, in wrapper
return f(self, *args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/docker/api/service.py", line 284, in services
return self._result(self._get(url, params=params), True)
File "/opt/conda/lib/python3.7/site-packages/docker/utils/decorators.py", line 46, in inner
return f(self, *args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/docker/api/client.py", line 230, in _get
return self.get(url, **self._set_request_timeout(kwargs))
File "/opt/conda/lib/python3.7/site-packages/requests/sessions.py", line 546, in get
return self.request('GET', url, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/requests/sessions.py", line 533, in request
resp = self.send(prep, **send_kwargs)
File "/opt/conda/lib/python3.7/site-packages/requests/sessions.py", line 646, in send
r = adapter.send(request, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/requests/adapters.py", line 529, in send
raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: UnixHTTPConnectionPool(host='localhost', port=None): Read timed out. (read timeout=60)
[E 190903 17:53:50 ioloop:909] Exception in callback <bound method KernelRestarter.poll of <jupyter_client.ioloop.restarter.IOLoopKernelRestarter object at 0x7f449c034d30>>
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/urllib3/connectionpool.py", line 384, in _make_request
six.raise_from(e, None)
File "", line 2, in raise_from
File "/opt/conda/lib/python3.7/site-packages/urllib3/connectionpool.py", line 380, in _make_request
httplib_response = conn.getresponse()
File "/opt/conda/lib/python3.7/http/client.py", line 1321, in getresponse
response.begin()
File "/opt/conda/lib/python3.7/http/client.py", line 296, in begin
version, status, reason = self._read_status()
File "/opt/conda/lib/python3.7/http/client.py", line 257, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/opt/conda/lib/python3.7/socket.py", line 589, in readinto
return self._sock.recv_into(b)
socket.timeout: timed out
enterprise-gateway_enterprise-gateway.1.8r05e9z6m4zm@clv235sl-c342e3 |
During handling of the above exception, another exception occurred:
enterprise-gateway_enterprise-gateway.1.8r05e9z6m4zm@clv235sl-c342e3 |
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/requests/adapters.py", line 449, in send
timeout=timeout
File "/opt/conda/lib/python3.7/site-packages/urllib3/connectionpool.py", line 638, in urlopen
_stacktrace=sys.exc_info()[2])
File "/opt/conda/lib/python3.7/site-packages/urllib3/util/retry.py", line 368, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/opt/conda/lib/python3.7/site-packages/urllib3/packages/six.py", line 686, in reraise
raise value
File "/opt/conda/lib/python3.7/site-packages/urllib3/connectionpool.py", line 600, in urlopen
chunked=chunked)
File "/opt/conda/lib/python3.7/site-packages/urllib3/connectionpool.py", line 386, in _make_request
self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
File "/opt/conda/lib/python3.7/site-packages/urllib3/connectionpool.py", line 306, in _raise_timeout
raise ReadTimeoutError(self, url, "Read timed out. (read timeout=%s)" % timeout_value)
urllib3.exceptions.ReadTimeoutError: UnixHTTPConnectionPool(host='localhost', port=None): Read timed out. (read timeout=60)
enterprise-gateway_enterprise-gateway.1.8r05e9z6m4zm@clv235sl-c342e3 |
During handling of the above exception, another exception occurred:
enterprise-gateway_enterprise-gateway.1.8r05e9z6m4zm@clv235sl-c342e3 |
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/tornado/ioloop.py", line 907, in _run
return self.callback()
File "/opt/conda/lib/python3.7/site-packages/jupyter_client/restarter.py", line 93, in poll
if not self.kernel_manager.is_alive():
File "/opt/conda/lib/python3.7/site-packages/jupyter_client/manager.py", line 453, in is_alive
if self.kernel.poll() is None:
File "/opt/conda/lib/python3.7/site-packages/enterprise_gateway/services/processproxies/container.py", line 112, in poll
container_status = self.get_container_status(None)
File "/opt/conda/lib/python3.7/site-packages/enterprise_gateway/services/processproxies/docker_swarm.py", line 74, in get_container_status
task = self._get_task()
File "/opt/conda/lib/python3.7/site-packages/enterprise_gateway/services/processproxies/docker_swarm.py", line 56, in _get_task
service = self._get_service()
File "/opt/conda/lib/python3.7/site-packages/enterprise_gateway/services/processproxies/docker_swarm.py", line 40, in _get_service
services = client.services.list(filters={'label': 'kernel_id=' + self.kernel_id})
File "/opt/conda/lib/python3.7/site-packages/docker/models/services.py", line 269, in list
for s in self.client.api.services(**kwargs)
File "/opt/conda/lib/python3.7/site-packages/docker/utils/decorators.py", line 34, in wrapper
return f(self, *args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/docker/api/service.py", line 284, in services
return self._result(self._get(url, params=params), True)
File "/opt/conda/lib/python3.7/site-packages/docker/utils/decorators.py", line 46, in inner
return f(self, *args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/docker/api/client.py", line 230, in _get
return self.get(url, **self._set_request_timeout(kwargs))
File "/opt/conda/lib/python3.7/site-packages/requests/sessions.py", line 546, in get
return self.request('GET', url, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/requests/sessions.py", line 533, in request
resp = self.send(prep, **send_kwargs)
File "/opt/conda/lib/python3.7/site-packages/requests/sessions.py", line 646, in send
r = adapter.send(request, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/requests/adapters.py", line 529, in send
raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: UnixHTTPConnectionPool(host='localhost', port=None): Read timed out. (read timeout=60)
What's curious to me is that Swarm appears to be having trouble performing this lookup...
File "/opt/conda/lib/python3.7/site-packages/enterprise_gateway/services/processproxies/docker_swarm.py", line 40, in _get_service
services = client.services.list(filters={'label': 'kernel_id=' + self.kernel_id})
and I'm wondering if Swarm itself is overloaded? The poll() method is used to determine if a given kernel is still alive and, if it thinks the kernel is not running, EG/Jupyter attempts to restart the kernel.
Are the Swarm or System logs showing any anomalies? (Not sure what additional logging Swarm might provide.)
It might be interesting to issue docker service ls
when this condition occurs.
Hi Kevin,
Thanks for such quick response! You are amazing!
I checked /var/log/messages and I see some timeout happening!
I will recreate scenario and try out docker service ls and update what happens.
Sep 3 10:53:14
I also have some questions about you mentioning "not sure at what point additional EG servers get configured." I am trying to understand role of each notebook server, notebook kernel and EG. Is every request routed via EG. I am going thru logs/code a bit let me get back to you with specific question once I have high level understanding.
@mihirkapadiap - thanks for checking lower-level logs and perform docker service ls
once back in this condition.
With the NB2KG server extension[*], the Notebook server that serves user's Notebook sessions has been configured to relay all kernel management and kernelspec operations to KG_URL. There, Enterprise Gateway (built on Jupyter Kernel Gateway), listens for those requests to retrieve kernelspecs and manage kernels. In addition, all traffic to/from the kernel itself, also gets sent via a websocket to the gateway server, where its split into/from the appropriate of the 5 ZeroMQ ports.
Enterprise Gateway is essentially Kernel Gateway, but with additional support for ProcessProxies - which facilitate lifecycle management of a given kernel relative to a given resource manager (in your case, Docker Swarm).
If you were to configure multiple EG servers and placed a load balancer application in front, then each kernel request could be distributed across the various EG servers. Keep in mind, however, that you'd need to configure the load-balancer to use sticky sessions, because we want the entire lifecycle of a given kernel to remain hosted by the same EG instance (at least until we support active/active HA).
I hope that helps.
[*]: Please note that in Notebook 6.0 the entire NB2KG extension and its configuration can be replaced with the command line option --gateway-url=<KG_URL>
. Some of the other env names have been changed as well. Just thought I'd mention this.
@kevin-bates
Ran again for 200 kernels!
I did see some error in /var/log/messages again.
However I could do docker service ls! I did it may be couple minutes after error received. It seems for all notebook kernels image became 0/1 (lost all notebooks/kernel died).
docker service ls
ID NAME MODE REPLICAS IMAGE PORTS cz9y8ekg2w7i enterprise-gateway_enterprise-gateway replicated 1/1 elyra/enterprise-gateway:dev 2867t16cjx3k guest-0cba85a3-1df3-4515-a74b-71268a78c2ca replicated 0/1 elyra/kernel-py:dev se89667ot341 guest-0d31d807-cb08-42c5-8d71-b9c81c6b5876 replicated 0/1 elyra/kernel-py:dev k2ybqapgpmg0 guest-0dffc888-c70b-4022-9cd5-52e7dcf4c031 replicated 0/1 elyra/kernel-py:dev
@kevin-bates
How do I run notebook 6. Just download notebook version 6 and run! Even with version 6 it will still use EG to route request and do management (like before version 6).
I see example on https://jupyter-enterprise-gateway.readthedocs.io/en/latest/getting-started.html
As of now I am using it!
docker run -t --rm
-e KG_URL='http://
-e KG_HTTP_USER=guest
-e KG_HTTP_PASS=guest-password
-p 8888:8888
-e VALIDATE_KG_CERT='no'
-e LOG_LEVEL=DEBUG
-e KG_REQUEST_TIMEOUT=40
-e KG_CONNECT_TIMEOUT=40
-v ${HOME}/notebooks/:/tmp/notebooks
-w /tmp/notebooks
elyra/nb2kg
Another issue I see is when starting EG in Swarm it needs to start on manager node (sometimes when I do docker ps if it is running on non leader node) then nothing works (as EG can not run command like docker.service.list on non manager node etc.)!
docker stack deploy -c docker-compose.yml enterprise-gateway
This is my understanding at high level!
- On Notebook server user requests a new Python on Docker kernel.
- Notebook server sends request for new kernel (http)
- JEG (Enterprise Gateway) launches new kernel (using launch_docker.py script! passing kernel-id and response socket (ip and port))
- JEG receives connection info for kernel (id, host, various ports etc.)
- JEG passes kernel_id back to Notebook server!
- Notebook server requests further kernel info and connects using ws!
You mentioned "There, Enterprise Gateway (built on Jupyter Kernel Gateway), listens for those requests to retrieve kernelspecs and manage kernels. In addition, all traffic to/from the kernel itself, also gets sent via a websocket to the gateway server, where its split into/from the appropriate of the 5 ZeroMQ ports."
Can you shed some light on how notebook server is getting info from EG?
Thanks in advance.
2867t16cjx3k guest-0cba85a3-1df3-4515-a74b-71268a78c2ca replicated 0/1 elyra/kernel-py:dev se89667ot341 guest-0d31d807-cb08-42c5-8d71-b9c81c6b5876 replicated 0/1 elyra/kernel-py:dev k2ybqapgpmg0 guest-0dffc888-c70b-4022-9cd5-52e7dcf4c031 replicated 0/1 elyra/kernel-py:dev
Could you try getting at the logs from one of these containers?
How do I run notebook 6. Just download notebook version 6 and run! Even with version 6 it will still use EG to route request and do management (like before version 6).
Since you're already using elyra/nb2kg, it might be easier to stick with that. However, if you had a pure notebook image (w/o the NB2KG extension) you could just set -e JUPYTER_GATEWAY_URL='http://:8888' -e JUPYTER_GATEWAY_HTTP_USER=guest - JUPYTER_GATEWAY_HTTP_PWD=guest-password
. If you just ran notebook from the command line, jupyter notebook --gateway-url=http://:8888
should do the trick also, but let's not worry about that unless you really need 6.0 functionality not provider in < 6.0.
Another issue I see is when starting EG in Swarm it needs to start on manager node (sometimes when I do docker ps if it is running on non leader node) then nothing works (as EG can not run command like docker.service.list on non manager node etc.)!
Yeah, we honestly haven't spent a lot of time with the Swarm offering. It mostly "came for free" after getting the Kubernetes work done. Having the ability to run service list
is a key function in order to discover where the kernels "landed" and to perform lifecycle management. I believe you can configure multiple manager nodes if multiple EGs couldn't "fit" on a single manager. If you see places that should be changed, please feel free to open issues for discussion and contribute pull requests. We'd be happy to help out.
Can you shed some light on how notebook server is getting info from EG?
For HTTP requests (requests to start, stop, interrupt, restart kernels and GET kernelspecs), the responses are returned via HTTP responses. For websocket operations, they're returned via the websocket. All responses (and requests) relay through the NB2KG layer since that's what performs the redirection.
@kevin-bates Hi,
Tried with bigger swarm cluster.
I get Too many open files error. I changed ulimit -n from 1024 to 4096 on all hosts but still get same error after I open about 210 kernels! I added following to /etc/security/limits.conf
-
hard nofile 4096
-
soft nofile 4096
ulimit -n is very high (1 mn) value for enterprise gateway service.
[D 2019-09-04 21:46:52.247 EnterpriseGatewayApp] Connecting to: tcp://10.0.0.214:60683
[E 190904 21:46:52 web:1788] Uncaught exception POST /api/kernels (172.18.0.1)
HTTPServerRequest(protocol='http', host='
Thanks for the update (sigh). Yeah, given the reports of various folks running into this outside of EG and the fact that EG is built on that same code base, I wouldn't be surprised there's a leak somewhere.
I also suspect this isn't relegated to Swarm.
EDIT: Hmm - but you're not terminating kernels - correct? You're just seeing how many can be running concurrently. If so, then there's really nothing to leak - unless there's some kind of connection thing happening for each kernel.
@kevin-bates
Any idea....
Tried creating 25 notebook server each opening 10 kernels. I see lots of kernels stuck or throwing kernel error. In notebook server logs I see timeouts.
I did increase timeout to 180.
-e KG_REQUEST_TIMEOUT=180
-e KG_CONNECT_TIMEOUT=180
EG logs: (Part of EG logs) [D 2019-09-09 19:59:00.796 EnterpriseGatewayApp] Received connection info for KernelID '93b5f2bc-6999-4886-835d-ac00dd33f91b' from host 'guest-93b5f2bc-6999-4886-835d-ac00dd33f91b': {'shell_port': 46489, 'iopub_port': 37879, 'stdin_port': 54059, 'control_port': 49080, 'hb_port': 40465, 'ip': '10.0.0.52', 'key': '0542c70a-77ff-4352-9aeb-cc4294cfce72', 'transport': 'tcp', 'signature_scheme': 'hmac-sha256', 'kernel_name': '', 'comm_port': 51033}... [D 2019-09-09 19:59:00.799 EnterpriseGatewayApp] Connecting to: tcp://10.0.0.52:49080 [D 2019-09-09 19:59:00.800 EnterpriseGatewayApp] Connecting to: tcp://10.0.0.52:37879 [I 2019-09-09 19:59:00.802 EnterpriseGatewayApp] Kernel started: 93b5f2bc-6999-4886-835d-ac00dd33f91b [D 2019-09-09 20:02:01.469 EnterpriseGatewayApp] Initializing websocket connection /api/kernels/93b5f2bc-6999-4886-835d-ac00dd33f91b/channels [W 2019-09-09 20:04:04.630 EnterpriseGatewayApp] Timeout waiting for kernel_info reply from 93b5f2bc-6999-4886-835d-ac00dd33f91b
Notetook server logs: [I 20:00:40.756 NotebookApp]^[(B^[[m Connecting to ws://<eg_host>:8888/api/kernels/93b5f2bc-6999-4886-835d-ac00dd33f91b/channels Exception in callback KernelGatewayWSClient._connection_done(<Future finis...uring request>) handle: <Handle KernelGatewayWSClient._connection_done(<Future finis...uring request>)> Traceback (most recent call last): File "/opt/conda/lib/python3.6/asyncio/events.py", line 145, in _run self._callback(*self._args) File "/opt/conda/lib/python3.6/site-packages/nb2kg/handlers.py", line 169, in _connection_done self.ws = fut.result() File "/opt/conda/lib/python3.6/site-packages/tornado/stack_context.py", line 315, in wrapped ret = fn(*args, **kwargs) File "/opt/conda/lib/python3.6/site-packages/tornado/simple_httpclient.py", line 271, in _on_timeout raise HTTPError(599, error_message) tornado.httpclient.HTTPError: HTTP 599: Timeout during request
@mihirkapadiap #580 is WIP, but effective to resolve delayed kernel start issue.
Now I'm looking into how to resolve this "concurrency" issue and "scalability" issue. https://discourse.jupyter.org/t/scalable-enterprise-gateway/2014/7
EDIT: I've just posted the discourse issue to #732 . Let's discuss and find the better architecture!