sparkmonitor
sparkmonitor copied to clipboard
AttributeError: 'ScalaMonitor' object has no attribute 'comm'
Summary of Issue
Note - Related container log error: ERROR JupyterSparkMonitorListener: Exception sending socket message:
Hello! We are running a Jupyterhub service that spawns Kubernetes pods for single user servers. We also run our jupyterhub proxy as a separate service. We've been experiencing a very persistent, but flaky, issue with the Sparkmonitor extension. When creating a spark session, I am seeing the following error AttributeError: 'ScalaMonitor' object has no attribute 'comm'
. However, this does not happen consistently. I oftentimes am able to create a session where the monitor will work but then if I restart the kernel, the monitor is broken with the above error (or vice versa).
For example, on my first attempt at creating a spark session, the extension works fine. Container log:
INFO:SparkMonitorKernel:Starting Kernel Extension
INFO:SparkMonitorKernel:Socket Listening on port 59845
INFO:SparkMonitorKernel:Starting socket thread, going to accept
INFO:SparkMonitorKernel:59845
INFO:SparkMonitorKernel:Adding jar from /work/venv_core_ml_kernel/lib/python3.7/site-packages/sparkmonitor/listener_2.12.jar
[I 2021-12-21 23:45:52.892 SingleUserNotebookApp log:189] 101 GET /user/username/api/kernels/e5800417-4831-4bb4-9eec-85d7fb16b388/channels?session_id=aca66ed3-e383-4dfd-bd21-eae11e96ccb2 ([email protected]) 1877.03ms
[I 2021-12-21 23:45:53.002 SingleUserNotebookApp log:189] 101 GET /user/username/api/kernels/e5800417-4831-4bb4-9eec-85d7fb16b388/channels?session_id=26868351-57cd-43c6-b14b-c85bd1842316 ([email protected]) 2.18ms
[I 2021-12-21 23:45:53.147 SingleUserNotebookApp log:189] 101 GET /user/username/api/kernels/e5800417-4831-4bb4-9eec-85d7fb16b388/channels?session_id=0ccb7d08-e61b-44e6-a08c-57a36d2797ed ([email protected]) 2.12ms
[I 2021-12-21 23:45:55.711 SingleUserNotebookApp log:189] 200 GET /user/username/api/terminals?1640130355597 ([email protected]) 1.13ms
[I 2021-12-21 23:45:56.253 SingleUserNotebookApp log:189] 200 GET /user/username/api/sessions?1640130356141 ([email protected]) 1.09ms
[I 2021-12-21 23:45:56.254 SingleUserNotebookApp log:189] 200 GET /user/username/api/kernels?1640130356142 ([email protected]) 1.01ms
INFO:SparkMonitorKernel:SparkMonitor comm opened from frontend.
INFO:SparkMonitorKernel:SparkMonitor comm opened from frontend.
[I 2021-12-21 23:46:03.074 SingleUserNotebookApp log:189] 200 GET /user/username/api/contents?content=1&1640130360080 ([email protected]) 2874.67ms
[I 2021-12-21 23:46:05.820 SingleUserNotebookApp log:189] 200 GET /user/username/api/terminals?1640130365706 ([email protected]) 1.36ms
[I 2021-12-21 23:46:05.981 SingleUserNotebookApp log:189] 200 GET /user/username/api/kernelspecs?1640130365868 ([email protected]) 2.78ms
[I 2021-12-21 23:46:07.068 SingleUserNotebookApp log:189] 200 GET /user/username/api/sessions?1640130366954 ([email protected]) 1.18ms
[I 2021-12-21 23:46:07.070 SingleUserNotebookApp log:189] 200 GET /user/username/api/kernels?1640130366956 ([email protected]) 1.06ms
[I 2021-12-21 23:46:16.011 SingleUserNotebookApp log:189] 200 GET /user/username/api/contents?content=1&1640130373104 ([email protected]) 2795.89ms
[I 2021-12-21 23:46:16.014 SingleUserNotebookApp log:189] 200 GET /user/username/api/terminals?1640130375820 ([email protected]) 0.98ms
[I 2021-12-21 23:46:17.181 SingleUserNotebookApp log:189] 200 GET /user/username/api/sessions?1640130377069 ([email protected]) 1.22ms
[I 2021-12-21 23:46:17.182 SingleUserNotebookApp log:189] 200 GET /user/username/api/kernels?1640130377071 ([email protected]) 0.90ms
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
I1221 23:46:20.384598 242 sched.cpp:232] Version: 1.7.2
I1221 23:46:20.385978 225 sched.cpp:336] New master detected at [email protected]:5050
I1221 23:46:20.386255 225 sched.cpp:401] Authenticating with master [email protected]:5050
I1221 23:46:20.386270 225 sched.cpp:408] Using default CRAM-MD5 authenticatee
I1221 23:46:20.386425 234 authenticatee.cpp:97] Initializing client SASL
I1221 23:46:20.387233 234 authenticatee.cpp:121] Creating new client SASL connection
W1221 23:46:20.397209 241 process.cpp:838] Failed to recv on socket 437 to peer '36.72.23.53:34312': Decoder error
I1221 23:46:20.398273 229 authenticatee.cpp:213] Received SASL authentication mechanisms: CRAM-MD5
I1221 23:46:20.398306 229 authenticatee.cpp:239] Attempting to authenticate with mechanism 'CRAM-MD5'
I1221 23:46:20.398959 238 authenticatee.cpp:259] Received SASL authentication step
I1221 23:46:20.399524 239 authenticatee.cpp:299] Authentication success
I1221 23:46:20.399653 236 sched.cpp:513] Successfully authenticated with master [email protected]:5050
I1221 23:46:20.401082 230 sched.cpp:744] Framework registered with 195f7369-64e5-4bc5-a8ed-8b576994450a-11035
21/12/21 23:46:20 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
INFO:SparkMonitorKernel:Client Connected ('127.0.0.1', 35478)
[I 2021-12-21 23:46:26.141 SingleUserNotebookApp log:189] 200 GET /user/username/api/terminals?1640130386026 ([email protected]) 1.00ms
[I 2021-12-21 23:46:28.410 SingleUserNotebookApp log:189] 200 GET /user/username/api/contents?content=1&1640130386068 ([email protected]) 2230.40ms
[I 2021-12-21 23:46:28.416 SingleUserNotebookApp log:189] 200 GET /user/username/api/sessions?1640130387180 ([email protected]) 3.12ms
[I 2021-12-21 23:46:28.417 SingleUserNotebookApp log:189] 200 GET /user/username/api/kernels?1640130387182 ([email protected]) 3.31ms
[I 2021-12-21 23:46:39.097 SingleUserNotebookApp log:189] 200 GET /user/username/api/contents?content=1&1640130396071 ([email protected]) 2914.93ms
[I 2021-12-21 23:46:39.100 SingleUserNotebookApp log:189] 200 GET /user/username/api/terminals?1640130396136 ([email protected]) 1.93ms
[I 2021-12-21 23:46:39.101 SingleUserNotebookApp log:189] 200 GET /user/username/api/sessions?1640130398473 ([email protected]) 2.26ms
[I 2021-12-21 23:46:39.101 SingleUserNotebookApp log:189] 200 GET /user/username/api/kernels?1640130398474 ([email protected]) 2.46ms
/source/virtualenv_run_jupyter/lib/python3.7/site-packages/jupyter_client/manager.py:358: FutureWarning: Method cleanup(connection_file=True) is deprecated, use cleanup_resources(restart=False).
FutureWarning)
[I 2021-12-21 23:46:39.217 SingleUserNotebookApp multikernelmanager:534] Kernel restarted: e5800417-4831-4bb4-9eec-85d7fb16b388
But if I restart the kernel and try again, the extension fails with AttributeError: 'ScalaMonitor' object has no attribute 'comm'
(Note: it does not consistently follow this pattern of first time it works, second time it fails. Often the first session will fail and then it works after a kernel restart. Or it will fail three times in a row then work.)
INFO:SparkMonitorKernel:Client Connected ('127.0.0.1', 35472)
Exception in thread Thread-8:
Traceback (most recent call last):
File "/usr/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/work/venv_core_ml_kernel/lib/python3.7/site-packages/sparkmonitor/kernelextension.py", line 126, in run
self.onrecv(msg)
File "/work/venv_core_ml_kernel/lib/python3.7/site-packages/sparkmonitor/kernelextension.py", line 145, in onrecv
'msg': msg
File "/work/venv_core_ml_kernel/lib/python3.7/site-packages/sparkmonitor/kernelextension.py", line 223, in sendToFrontEnd
monitor.send(msg)
File "/work/venv_core_ml_kernel/lib/python3.7/site-packages/sparkmonitor/kernelextension.py", line 57, in send
self.comm.send(msg)
AttributeError: 'ScalaMonitor' object has no attribute 'comm'
Container log on failure:
[I 2021-12-21 23:46:39.219 SingleUserNotebookApp log:189] 200 POST /user/username/api/kernels/e5800417-4831-4bb4-9eec-85d7fb16b388/restart?1640130391290 ([email protected]) 7817.75ms
[I 2021-12-21 23:46:39.388 SingleUserNotebookApp log:189] 101 GET /user/username/api/kernels/e5800417-4831-4bb4-9eec-85d7fb16b388/channels?session_id=26868351-57cd-43c6-b14b-c85bd1842316 ([email protected]) 2.70ms
INFO:SparkMonitorKernel:Starting Kernel Extension
INFO:SparkMonitorKernel:Socket Listening on port 53939
INFO:SparkMonitorKernel:Starting socket thread, going to accept
INFO:SparkMonitorKernel:53939
INFO:SparkMonitorKernel:Adding jar from /work/venv_core_ml_kernel/lib/python3.7/site-packages/sparkmonitor/listener_2.12.jar
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
I1221 23:46:45.923640 389 sched.cpp:232] Version: 1.7.2
I1221 23:46:45.925024 378 sched.cpp:336] New master detected at [email protected]:5050
I1221 23:46:45.925308 378 sched.cpp:401] Authenticating with master [email protected]:5050
I1221 23:46:45.925323 378 sched.cpp:408] Using default CRAM-MD5 authenticatee
I1221 23:46:45.925534 382 authenticatee.cpp:97] Initializing client SASL
I1221 23:46:45.926290 382 authenticatee.cpp:121] Creating new client SASL connection
W1221 23:46:45.928249 388 process.cpp:838] Failed to recv on socket 437 to peer '36.72.23.53:37450': Decoder error
I1221 23:46:45.929303 375 authenticatee.cpp:213] Received SASL authentication mechanisms: CRAM-MD5
I1221 23:46:45.929327 375 authenticatee.cpp:239] Attempting to authenticate with mechanism 'CRAM-MD5'
I1221 23:46:45.929982 383 authenticatee.cpp:259] Received SASL authentication step
I1221 23:46:45.930640 386 authenticatee.cpp:299] Authentication success
I1221 23:46:45.930742 385 sched.cpp:513] Successfully authenticated with master [email protected]:5050
I1221 23:46:45.933359 377 sched.cpp:744] Framework registered with 195f7369-64e5-4bc5-a8ed-8b576994450a-11038
21/12/21 23:46:46 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
INFO:SparkMonitorKernel:Client Connected ('127.0.0.1', 35472)
INFO:SparkMonitorKernel:SparkMonitor comm opened from frontend.
[I 2021-12-21 23:46:49.219 SingleUserNotebookApp log:189] 200 GET /user/username/api/terminals?1640130409107 ([email protected]) 1.17ms
[I 2021-12-21 23:46:49.220 SingleUserNotebookApp log:189] 200 GET /user/username/api/sessions?1640130409109 ([email protected]) 1.04ms
[I 2021-12-21 23:46:49.222 SingleUserNotebookApp log:189] 200 GET /user/username/api/kernels?1640130409111 ([email protected]) 0.99ms
21/12/21 23:46:51 ERROR JupyterSparkMonitorListener: Exception sending socket message:
java.net.SocketException: Broken pipe (Write failed)
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:295)
at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:141)
at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:229)
at sparkmonitor.listener.JupyterSparkMonitorListener.send(CustomListener.scala:54)
at sparkmonitor.listener.JupyterSparkMonitorListener.onExecutorAdded(CustomListener.scala:652)
at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:63)
at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28)
at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:115)
at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:99)
at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105)
at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105)
at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100)
at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1319)
at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96)
I'm experiencing this issue even with a minimal configuration such as this:
from pyspark.sql import SparkSession
spark = SparkSession.builder\
.config('spark.extraListeners', 'sparkmonitor.listener.JupyterSparkMonitorListener')\
.config('spark.driver.extraClassPath', '/source/virtualenv_run_jupyter/lib/python3.7/site-packages/sparkmonitor/listener_2.12.jar:/opt/spark/extra_jars/guava-27.0.1-jre.jar')\
.getOrCreate()
Additional Details
We are using the following Jupyter related packages/dependencies in our containers:
ipykernel==5.3.4
ipython==7.20.0
jupyter-client==6.1.5
jupyter-core==4.7.1
jupyter-packaging==0.10.2
jupyter-server==1.7.0
jupyter-server-proxy==1.6.0
jupyter-telemetry==0.1.0
jupyterhub==1.4.2
jupyterlab==3.0.16
jupyterlab-pygments==0.1.2
jupyterlab-server==2.5.2
jupyterlab-widgets==1.0.0
nbconvert==5.5.0
nbformat==5.1.3
traitlets==4.3.3
Any assistance with troubleshooting this issue would be greatly appreciated. A lot of our users really like the Sparkmonitor! Please let me know if I can provide any other details about our setup that would be helpful. Thank you!
Is there any update on this issue? We have observed a similar issue when using Jupyter Enterprise Gateway to spawn pods on Kubernetes for Jupyter Notebook kernels. No such comm opened Warning and Broken Pipe Error on restart We worked around it by introducing a sleep in JupyterSparkMonitorListener of CustomListener.scala before socket connection was established(to allow frontend to register before socket connection from listener to kernel)
hi @krishnan-r I think the issue is somewhere around the "server socket" handling that is happening in the "run()" method: https://github.com/swan-cern/sparkmonitor/blob/master/sparkmonitor/kernelextension.py#L103
-
A broken pipe error could be because the server / tcp listener socket closed the client connection and there is no logic on client side to catch this and re-establish the connection. (There is also a scope of improvement on scala client code)
-
I m still unclean so as to why the "comm" issue is happening.. I am planning to look into this further but if you have any pointer please let me know. thanks!
Apologies for the late response, I'm able to reproduce this locally when doing Restart Kernel and Run All Cells...
and I think it might be a race condition affecting other scenarios as well.
It's happening because the execute requests from the frontend reach the kernel before the frontend extension can connect to the kernel comm. This causes spark to start the listener in the JVM which sends messages on the socket to the kernel. The kernel side of the socket tries to forward the message to the frontend on a non-existing comm object, raises an exception and no longer reads the socket causing a broken pipe. (The code here could use some improvement for the error message)
One possible solution I think would be to make the kernel connect to the frontend via comm instead of the frontend to the kernel. This way we can block and ensure a connection before an execute request with spark application code is executed.
Or alternatively we need an extension point on the frontend which is earlier than notebookPanel.sessionContext.session.kernel.statusChanged
to open a comm before other code can run.
This possibly never occurred in our deployment because we have an additional step on a UI where the user clicks a button to start the spark context, which only happens after a few seconds.
I will put together a draft PR with a fix soon
I've successfully resolved a same problem, and it's about browser cache!
The log of jupyterlab didn't show "comm opened from frontend" until I cleaned the browser cache!
INFO:SparkMonitorKernel:Starting Kernel Extension
INFO:SparkMonitorKernel:Socket Listening on port 33415
INFO:SparkMonitorKernel:Starting socket thread, going to accept
INFO:SparkMonitorKernel:33415
INFO:SparkMonitorKernel:Starting Kernel Extension
INFO:SparkMonitorKernel:Socket Listening on port 52331
INFO:SparkMonitorKernel:Starting socket thread, going to accept
INFO:SparkMonitorKernel:52331
INFO:SparkMonitorKernel:Starting Kernel Extension
INFO:SparkMonitorKernel:Socket Listening on port 45483
INFO:SparkMonitorKernel:Starting socket thread, going to accept
INFO:SparkMonitorKernel:45483
INFO:SparkMonitorKernel:Adding jar from /root/miniconda3/envs/py37/lib/python3.7/site-packages/sparkmonitor/listener_2.12.jar
After I cleaned the browser cache and refreshed page, jupyterlab log says comm opened, and the sparkmonitor panel shows up! 🎉
INFO:SparkMonitorKernel:SparkMonitor comm opened from frontend.
Hi, we also experience the same issue. Do you have any ETA on when it is going to be fixed?
Seeing this as well. Although I did not have to run the spark session creation quickly which led me to believe that it might not be a race condition (at least in the way I saw it).
In the console of the browser I see the following:
JupyterLab SparkMonitor is activated!
Plugin `jupyterlab_sparkmonitor` failed to activate
TypeError: Cannot use 'in' operator to reach for 'rank' in false
...
SparkMonitor: Starting Comm with kernel.
SparkMonitor: Starting Comm with kernel.
Uncaught (in promise) TypeError: Cannot read properties of undefined (reading 'connect')
The second error is coming from here. I can't tell where the first one is coming from.
@diocas @etejedor Any idea what is going on here?
Versions:
- JupyterLab: 3.4.8
- SparkMonitor: 3.0.2
Downgrading to sparkmonitor==2.1.1
fixes issue on my end. After seeing the following the browser console, if is safe to start the spark session
SparkMonitor: Starting Comm with kerenel
SparkMonitor: Connection with comms established
Note: The race condition still exists, but the issue with the JS / TS plugin crashing is not an issue.
Oh wow, okay so this is more finky than I thought.
- The browser console logs do not seem to be a reliable signal for the python kernel extension to actually be connected and have the
comm
attribute. - It seems, fairly consistently, that if I shutdown the kernel after connection has been established the front end will rail to re-connect.
There are definitely still some other oddities floating around, can't pin it down 100%.
Currently testing on jupyterlab==3.6.7
and sparkmonitor==2.1.1
.
And sparkmonitor==3.0.2
still failing with issues mentioned above.
Here is some code to check if the comms have been established, at least this gives some user feedback and prevents the extension from crashing and being unusable afterwards.
import time
from sparkmonitor import kernelextension
TIMEOUT = 60
INC = 10
count = 0
while not hasattr(kernelextension.monitor, 'comm'):
if count == 0:
print("Waiting for sparkmonitor connections to be established...")
elif count == 30:
print("This is taking longer than usual... maybe try:")
print("\t1. Kernel > Shut Down Kernel")
print("\t2. Refresh your browser then re-run the cell.")
if count >= TIMEOUT:
raise RuntimeError("Sparkmonitor connections failed to establish!")
time.sleep(INC)
count += INC
print("Sparkmonitor connections established!")