sparkmonitor icon indicating copy to clipboard operation
sparkmonitor copied to clipboard

AttributeError: 'ScalaMonitor' object has no attribute 'comm'

Open nlspoon opened this issue 3 years ago • 8 comments

Summary of Issue

Note - Related container log error: ERROR JupyterSparkMonitorListener: Exception sending socket message:

Hello! We are running a Jupyterhub service that spawns Kubernetes pods for single user servers. We also run our jupyterhub proxy as a separate service. We've been experiencing a very persistent, but flaky, issue with the Sparkmonitor extension. When creating a spark session, I am seeing the following error AttributeError: 'ScalaMonitor' object has no attribute 'comm'. However, this does not happen consistently. I oftentimes am able to create a session where the monitor will work but then if I restart the kernel, the monitor is broken with the above error (or vice versa).

For example, on my first attempt at creating a spark session, the extension works fine. Container log:

INFO:SparkMonitorKernel:Starting Kernel Extension
INFO:SparkMonitorKernel:Socket Listening on port 59845
INFO:SparkMonitorKernel:Starting socket thread, going to accept
INFO:SparkMonitorKernel:59845
INFO:SparkMonitorKernel:Adding jar from /work/venv_core_ml_kernel/lib/python3.7/site-packages/sparkmonitor/listener_2.12.jar 
[I 2021-12-21 23:45:52.892 SingleUserNotebookApp log:189] 101 GET /user/username/api/kernels/e5800417-4831-4bb4-9eec-85d7fb16b388/channels?session_id=aca66ed3-e383-4dfd-bd21-eae11e96ccb2 ([email protected]) 1877.03ms
[I 2021-12-21 23:45:53.002 SingleUserNotebookApp log:189] 101 GET /user/username/api/kernels/e5800417-4831-4bb4-9eec-85d7fb16b388/channels?session_id=26868351-57cd-43c6-b14b-c85bd1842316 ([email protected]) 2.18ms
[I 2021-12-21 23:45:53.147 SingleUserNotebookApp log:189] 101 GET /user/username/api/kernels/e5800417-4831-4bb4-9eec-85d7fb16b388/channels?session_id=0ccb7d08-e61b-44e6-a08c-57a36d2797ed ([email protected]) 2.12ms
[I 2021-12-21 23:45:55.711 SingleUserNotebookApp log:189] 200 GET /user/username/api/terminals?1640130355597 ([email protected]) 1.13ms
[I 2021-12-21 23:45:56.253 SingleUserNotebookApp log:189] 200 GET /user/username/api/sessions?1640130356141 ([email protected]) 1.09ms
[I 2021-12-21 23:45:56.254 SingleUserNotebookApp log:189] 200 GET /user/username/api/kernels?1640130356142 ([email protected]) 1.01ms
INFO:SparkMonitorKernel:SparkMonitor comm opened from frontend.
INFO:SparkMonitorKernel:SparkMonitor comm opened from frontend.
[I 2021-12-21 23:46:03.074 SingleUserNotebookApp log:189] 200 GET /user/username/api/contents?content=1&1640130360080 ([email protected]) 2874.67ms
[I 2021-12-21 23:46:05.820 SingleUserNotebookApp log:189] 200 GET /user/username/api/terminals?1640130365706 ([email protected]) 1.36ms
[I 2021-12-21 23:46:05.981 SingleUserNotebookApp log:189] 200 GET /user/username/api/kernelspecs?1640130365868 ([email protected]) 2.78ms
[I 2021-12-21 23:46:07.068 SingleUserNotebookApp log:189] 200 GET /user/username/api/sessions?1640130366954 ([email protected]) 1.18ms
[I 2021-12-21 23:46:07.070 SingleUserNotebookApp log:189] 200 GET /user/username/api/kernels?1640130366956 ([email protected]) 1.06ms
[I 2021-12-21 23:46:16.011 SingleUserNotebookApp log:189] 200 GET /user/username/api/contents?content=1&1640130373104 ([email protected]) 2795.89ms
[I 2021-12-21 23:46:16.014 SingleUserNotebookApp log:189] 200 GET /user/username/api/terminals?1640130375820 ([email protected]) 0.98ms
[I 2021-12-21 23:46:17.181 SingleUserNotebookApp log:189] 200 GET /user/username/api/sessions?1640130377069 ([email protected]) 1.22ms
[I 2021-12-21 23:46:17.182 SingleUserNotebookApp log:189] 200 GET /user/username/api/kernels?1640130377071 ([email protected]) 0.90ms
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
I1221 23:46:20.384598   242 sched.cpp:232] Version: 1.7.2
I1221 23:46:20.385978   225 sched.cpp:336] New master detected at [email protected]:5050
I1221 23:46:20.386255   225 sched.cpp:401] Authenticating with master [email protected]:5050
I1221 23:46:20.386270   225 sched.cpp:408] Using default CRAM-MD5 authenticatee
I1221 23:46:20.386425   234 authenticatee.cpp:97] Initializing client SASL
I1221 23:46:20.387233   234 authenticatee.cpp:121] Creating new client SASL connection
W1221 23:46:20.397209   241 process.cpp:838] Failed to recv on socket 437 to peer '36.72.23.53:34312': Decoder error
I1221 23:46:20.398273   229 authenticatee.cpp:213] Received SASL authentication mechanisms: CRAM-MD5
I1221 23:46:20.398306   229 authenticatee.cpp:239] Attempting to authenticate with mechanism 'CRAM-MD5'
I1221 23:46:20.398959   238 authenticatee.cpp:259] Received SASL authentication step
I1221 23:46:20.399524   239 authenticatee.cpp:299] Authentication success
I1221 23:46:20.399653   236 sched.cpp:513] Successfully authenticated with master [email protected]:5050
I1221 23:46:20.401082   230 sched.cpp:744] Framework registered with 195f7369-64e5-4bc5-a8ed-8b576994450a-11035
21/12/21 23:46:20 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
INFO:SparkMonitorKernel:Client Connected ('127.0.0.1', 35478)
[I 2021-12-21 23:46:26.141 SingleUserNotebookApp log:189] 200 GET /user/username/api/terminals?1640130386026 ([email protected]) 1.00ms
[I 2021-12-21 23:46:28.410 SingleUserNotebookApp log:189] 200 GET /user/username/api/contents?content=1&1640130386068 ([email protected]) 2230.40ms
[I 2021-12-21 23:46:28.416 SingleUserNotebookApp log:189] 200 GET /user/username/api/sessions?1640130387180 ([email protected]) 3.12ms
[I 2021-12-21 23:46:28.417 SingleUserNotebookApp log:189] 200 GET /user/username/api/kernels?1640130387182 ([email protected]) 3.31ms
[I 2021-12-21 23:46:39.097 SingleUserNotebookApp log:189] 200 GET /user/username/api/contents?content=1&1640130396071 ([email protected]) 2914.93ms
[I 2021-12-21 23:46:39.100 SingleUserNotebookApp log:189] 200 GET /user/username/api/terminals?1640130396136 ([email protected]) 1.93ms
[I 2021-12-21 23:46:39.101 SingleUserNotebookApp log:189] 200 GET /user/username/api/sessions?1640130398473 ([email protected]) 2.26ms
[I 2021-12-21 23:46:39.101 SingleUserNotebookApp log:189] 200 GET /user/username/api/kernels?1640130398474 ([email protected]) 2.46ms
/source/virtualenv_run_jupyter/lib/python3.7/site-packages/jupyter_client/manager.py:358: FutureWarning: Method cleanup(connection_file=True) is deprecated, use cleanup_resources(restart=False).
  FutureWarning)
[I 2021-12-21 23:46:39.217 SingleUserNotebookApp multikernelmanager:534] Kernel restarted: e5800417-4831-4bb4-9eec-85d7fb16b388

But if I restart the kernel and try again, the extension fails with AttributeError: 'ScalaMonitor' object has no attribute 'comm' (Note: it does not consistently follow this pattern of first time it works, second time it fails. Often the first session will fail and then it works after a kernel restart. Or it will fail three times in a row then work.)

INFO:SparkMonitorKernel:Client Connected ('127.0.0.1', 35472)
Exception in thread Thread-8:
Traceback (most recent call last):
  File "/usr/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/work/venv_core_ml_kernel/lib/python3.7/site-packages/sparkmonitor/kernelextension.py", line 126, in run
    self.onrecv(msg)
  File "/work/venv_core_ml_kernel/lib/python3.7/site-packages/sparkmonitor/kernelextension.py", line 145, in onrecv
    'msg': msg
  File "/work/venv_core_ml_kernel/lib/python3.7/site-packages/sparkmonitor/kernelextension.py", line 223, in sendToFrontEnd
    monitor.send(msg)
  File "/work/venv_core_ml_kernel/lib/python3.7/site-packages/sparkmonitor/kernelextension.py", line 57, in send
    self.comm.send(msg)
AttributeError: 'ScalaMonitor' object has no attribute 'comm'

Container log on failure:

[I 2021-12-21 23:46:39.219 SingleUserNotebookApp log:189] 200 POST /user/username/api/kernels/e5800417-4831-4bb4-9eec-85d7fb16b388/restart?1640130391290 ([email protected]) 7817.75ms
[I 2021-12-21 23:46:39.388 SingleUserNotebookApp log:189] 101 GET /user/username/api/kernels/e5800417-4831-4bb4-9eec-85d7fb16b388/channels?session_id=26868351-57cd-43c6-b14b-c85bd1842316 ([email protected]) 2.70ms
INFO:SparkMonitorKernel:Starting Kernel Extension
INFO:SparkMonitorKernel:Socket Listening on port 53939
INFO:SparkMonitorKernel:Starting socket thread, going to accept
INFO:SparkMonitorKernel:53939
INFO:SparkMonitorKernel:Adding jar from /work/venv_core_ml_kernel/lib/python3.7/site-packages/sparkmonitor/listener_2.12.jar 
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
I1221 23:46:45.923640   389 sched.cpp:232] Version: 1.7.2
I1221 23:46:45.925024   378 sched.cpp:336] New master detected at [email protected]:5050
I1221 23:46:45.925308   378 sched.cpp:401] Authenticating with master [email protected]:5050
I1221 23:46:45.925323   378 sched.cpp:408] Using default CRAM-MD5 authenticatee
I1221 23:46:45.925534   382 authenticatee.cpp:97] Initializing client SASL
I1221 23:46:45.926290   382 authenticatee.cpp:121] Creating new client SASL connection
W1221 23:46:45.928249   388 process.cpp:838] Failed to recv on socket 437 to peer '36.72.23.53:37450': Decoder error
I1221 23:46:45.929303   375 authenticatee.cpp:213] Received SASL authentication mechanisms: CRAM-MD5
I1221 23:46:45.929327   375 authenticatee.cpp:239] Attempting to authenticate with mechanism 'CRAM-MD5'
I1221 23:46:45.929982   383 authenticatee.cpp:259] Received SASL authentication step
I1221 23:46:45.930640   386 authenticatee.cpp:299] Authentication success
I1221 23:46:45.930742   385 sched.cpp:513] Successfully authenticated with master [email protected]:5050
I1221 23:46:45.933359   377 sched.cpp:744] Framework registered with 195f7369-64e5-4bc5-a8ed-8b576994450a-11038
21/12/21 23:46:46 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
INFO:SparkMonitorKernel:Client Connected ('127.0.0.1', 35472)
INFO:SparkMonitorKernel:SparkMonitor comm opened from frontend.
[I 2021-12-21 23:46:49.219 SingleUserNotebookApp log:189] 200 GET /user/username/api/terminals?1640130409107 ([email protected]) 1.17ms
[I 2021-12-21 23:46:49.220 SingleUserNotebookApp log:189] 200 GET /user/username/api/sessions?1640130409109 ([email protected]) 1.04ms
[I 2021-12-21 23:46:49.222 SingleUserNotebookApp log:189] 200 GET /user/username/api/kernels?1640130409111 ([email protected]) 0.99ms
21/12/21 23:46:51 ERROR JupyterSparkMonitorListener: Exception sending socket message: 
java.net.SocketException: Broken pipe (Write failed)
        at java.net.SocketOutputStream.socketWrite0(Native Method)
        at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
        at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
        at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
        at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
        at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:295)
        at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:141)
        at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:229)
        at sparkmonitor.listener.JupyterSparkMonitorListener.send(CustomListener.scala:54)
        at sparkmonitor.listener.JupyterSparkMonitorListener.onExecutorAdded(CustomListener.scala:652)
        at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:63)
        at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28)
        at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
        at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
        at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:115)
        at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:99)
        at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105)
        at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105)
        at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23)
        at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
        at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100)
        at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96)
        at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1319)
        at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96)

I'm experiencing this issue even with a minimal configuration such as this:

from pyspark.sql import SparkSession
spark = SparkSession.builder\
        .config('spark.extraListeners', 'sparkmonitor.listener.JupyterSparkMonitorListener')\
        .config('spark.driver.extraClassPath', '/source/virtualenv_run_jupyter/lib/python3.7/site-packages/sparkmonitor/listener_2.12.jar:/opt/spark/extra_jars/guava-27.0.1-jre.jar')\
        .getOrCreate()

Additional Details

We are using the following Jupyter related packages/dependencies in our containers:

ipykernel==5.3.4
ipython==7.20.0
jupyter-client==6.1.5
jupyter-core==4.7.1
jupyter-packaging==0.10.2
jupyter-server==1.7.0
jupyter-server-proxy==1.6.0
jupyter-telemetry==0.1.0
jupyterhub==1.4.2
jupyterlab==3.0.16
jupyterlab-pygments==0.1.2
jupyterlab-server==2.5.2
jupyterlab-widgets==1.0.0
nbconvert==5.5.0
nbformat==5.1.3
traitlets==4.3.3

Any assistance with troubleshooting this issue would be greatly appreciated. A lot of our users really like the Sparkmonitor! Please let me know if I can provide any other details about our setup that would be helpful. Thank you!

nlspoon avatar Jan 04 '22 23:01 nlspoon

Is there any update on this issue? We have observed a similar issue when using Jupyter Enterprise Gateway to spawn pods on Kubernetes for Jupyter Notebook kernels. No such comm opened Warning and Broken Pipe Error on restart We worked around it by introducing a sleep in JupyterSparkMonitorListener of CustomListener.scala before socket connection was established(to allow frontend to register before socket connection from listener to kernel)

akhileshram avatar Feb 24 '22 06:02 akhileshram

hi @krishnan-r I think the issue is somewhere around the "server socket" handling that is happening in the "run()" method: https://github.com/swan-cern/sparkmonitor/blob/master/sparkmonitor/kernelextension.py#L103

  1. A broken pipe error could be because the server / tcp listener socket closed the client connection and there is no logic on client side to catch this and re-establish the connection. (There is also a scope of improvement on scala client code)

  2. I m still unclean so as to why the "comm" issue is happening.. I am planning to look into this further but if you have any pointer please let me know. thanks!

rahul26goyal avatar Mar 05 '22 12:03 rahul26goyal

Apologies for the late response, I'm able to reproduce this locally when doing Restart Kernel and Run All Cells... and I think it might be a race condition affecting other scenarios as well.

It's happening because the execute requests from the frontend reach the kernel before the frontend extension can connect to the kernel comm. This causes spark to start the listener in the JVM which sends messages on the socket to the kernel. The kernel side of the socket tries to forward the message to the frontend on a non-existing comm object, raises an exception and no longer reads the socket causing a broken pipe. (The code here could use some improvement for the error message)

One possible solution I think would be to make the kernel connect to the frontend via comm instead of the frontend to the kernel. This way we can block and ensure a connection before an execute request with spark application code is executed.

Or alternatively we need an extension point on the frontend which is earlier than notebookPanel.sessionContext.session.kernel.statusChanged to open a comm before other code can run.

This possibly never occurred in our deployment because we have an additional step on a UI where the user clicks a button to start the spark context, which only happens after a few seconds.

I will put together a draft PR with a fix soon

krishnan-r avatar Mar 25 '22 08:03 krishnan-r

I've successfully resolved a same problem, and it's about browser cache!

The log of jupyterlab didn't show "comm opened from frontend" until I cleaned the browser cache!

INFO:SparkMonitorKernel:Starting Kernel Extension
INFO:SparkMonitorKernel:Socket Listening on port 33415
INFO:SparkMonitorKernel:Starting socket thread, going to accept
INFO:SparkMonitorKernel:33415
INFO:SparkMonitorKernel:Starting Kernel Extension
INFO:SparkMonitorKernel:Socket Listening on port 52331
INFO:SparkMonitorKernel:Starting socket thread, going to accept
INFO:SparkMonitorKernel:52331
INFO:SparkMonitorKernel:Starting Kernel Extension
INFO:SparkMonitorKernel:Socket Listening on port 45483
INFO:SparkMonitorKernel:Starting socket thread, going to accept
INFO:SparkMonitorKernel:45483
INFO:SparkMonitorKernel:Adding jar from /root/miniconda3/envs/py37/lib/python3.7/site-packages/sparkmonitor/listener_2.12.jar 

After I cleaned the browser cache and refreshed page, jupyterlab log says comm opened, and the sparkmonitor panel shows up! 🎉 INFO:SparkMonitorKernel:SparkMonitor comm opened from frontend.

liangchen-harold avatar May 26 '22 06:05 liangchen-harold

Hi, we also experience the same issue. Do you have any ETA on when it is going to be fixed?

inteloid avatar Jan 15 '23 05:01 inteloid

Seeing this as well. Although I did not have to run the spark session creation quickly which led me to believe that it might not be a race condition (at least in the way I saw it).

In the console of the browser I see the following:

JupyterLab SparkMonitor is activated!
Plugin `jupyterlab_sparkmonitor` failed to activate
TypeError: Cannot use 'in' operator to reach for 'rank' in false
...
SparkMonitor: Starting Comm with kernel.
SparkMonitor: Starting Comm with kernel.
Uncaught (in promise) TypeError: Cannot read properties of undefined (reading 'connect')

The second error is coming from here. I can't tell where the first one is coming from.

@diocas @etejedor Any idea what is going on here?


Versions:

  • JupyterLab: 3.4.8
  • SparkMonitor: 3.0.2

CarterFendley avatar Apr 06 '24 03:04 CarterFendley

Downgrading to sparkmonitor==2.1.1 fixes issue on my end. After seeing the following the browser console, if is safe to start the spark session

SparkMonitor: Starting Comm with kerenel
SparkMonitor: Connection with comms established

Note: The race condition still exists, but the issue with the JS / TS plugin crashing is not an issue.

CarterFendley avatar Apr 06 '24 03:04 CarterFendley

Oh wow, okay so this is more finky than I thought.

  1. The browser console logs do not seem to be a reliable signal for the python kernel extension to actually be connected and have the comm attribute.
  2. It seems, fairly consistently, that if I shutdown the kernel after connection has been established the front end will rail to re-connect.

There are definitely still some other oddities floating around, can't pin it down 100%.

Currently testing on jupyterlab==3.6.7 and sparkmonitor==2.1.1.

And sparkmonitor==3.0.2 still failing with issues mentioned above.


Here is some code to check if the comms have been established, at least this gives some user feedback and prevents the extension from crashing and being unusable afterwards.

import time
from sparkmonitor import kernelextension

TIMEOUT = 60
INC = 10

count = 0
while not hasattr(kernelextension.monitor, 'comm'):
    if count == 0:
        print("Waiting for sparkmonitor connections to be established...")
    elif count == 30:
        print("This is taking longer than usual... maybe try:")
        print("\t1. Kernel > Shut Down Kernel")
        print("\t2. Refresh your browser then re-run the cell.")
    
    if count >= TIMEOUT:
        raise RuntimeError("Sparkmonitor connections failed to establish!")
    
    time.sleep(INC)
    count += INC

print("Sparkmonitor connections established!")

CarterFendley avatar Apr 09 '24 04:04 CarterFendley