enterprise_gateway
enterprise_gateway copied to clipboard
Scala yarn cluster kernelspec should use 'eager' spark context initialization
I'm seeing an issue with Scala (Toree) kernels running in YARN cluster mode when no action is taken after the kernel has started for more than 10 minutes despite the fact that we're setting spark.yarn.am.waitTime=1d. This does not happen with Python or R kernels in cluster mode because we initialize the spark context in the background (within the launcher). In toree, the current 'lazy' initialization does not trigger spark context creation until the first action in the kernel. Granted, 10+ minutes is somewhat impractical since a kernel's startup is usually followed by some activity within the first 10 minutes. However, if we were to entertain a "prespawned kernels" feature (See #374), of which Toree would be a prime candidate, the kernels could sit for some time before any activity.
I've searched the system config files, spark sources and googled to see if there's a max value for spark.yarn.am.waitTime although the varying times would contradict that theory anyway. Not sure what's causing YARN to decide to terminate despite the waitTime value of a day.
We can discuss this further since its probably not urgent, but just wanted to get this on the books and its a simple change - should that be what we want.
I've attached a screen shot of a single Toree kernel re-started 3 additional times after 10, 13 and 13 minutes respectively.

Just confirmed that this will occur even if there's work being done in the notebook that doesn't include the spark context (so its creation is not triggered).
Here are the EG log messages. The automatic (3 second) polling detects the kernel has gone away, so it attempts restarts.
Kernel starts, work performed...
[D 2018-11-15 09:05:35.535 EnterpriseGatewayApp] activity on cb7fcf57-8a78-44df-85cd-4ca9071e4412: execute_input
[D 2018-11-15 09:05:38.252 EnterpriseGatewayApp] activity on cb7fcf57-8a78-44df-85cd-4ca9071e4412: display_data
[D 2018-11-15 09:05:38.305 EnterpriseGatewayApp] activity on cb7fcf57-8a78-44df-85cd-4ca9071e4412: execute_result
[D 2018-11-15 09:05:38.311 EnterpriseGatewayApp] activity on cb7fcf57-8a78-44df-85cd-4ca9071e4412: status (idle)
10 minutes later it detects the YARN application is missing and restarts...
[I 2018-11-15 09:15:40.842 EnterpriseGatewayApp] KernelRestarter: restarting kernel (1/5), keep random ports
[W 181115 09:15:40 handlers:472] kernel cb7fcf57-8a78-44df-85cd-4ca9071e4412 restarted
[D 2018-11-15 09:15:40.843 EnterpriseGatewayApp] RemoteKernelManager.signal_kernel(9)
[D 2018-11-15 09:15:40.844 EnterpriseGatewayApp] YarnClusterProcessProxy.send_signal 9
[W 2018-11-15 09:15:40.861 EnterpriseGatewayApp] Termination of application 'application_1538065321075_0068' failed with exception: 'Response finished with status: 500. Details: {"RemoteException":{"exception":"WebApplicationException","message":"com.sun.jersey.api.MessageException: A message body reader for Java class org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.AppState, and Java type class org.apache.hadoop.yarn.server.resourcemanager.webapp.dao.AppState, and MIME media type application/octet-stream was not found.\nThe registered message body readers compatible with the MIME media type are:\napplication/octet-stream ->\n com.sun.jersey.core.impl.provider.entity.ByteArrayProvider\n com.sun.jersey.core.impl.provider.entity.FileProvider\n com.sun.jersey.core.impl.provider.entity.InputStreamProvider\n com.sun.jersey.core.impl.provider.entity.DataSourceProvider\n com.sun.jersey.core.impl.provider.entity.RenderedImageProvider\n*/* ->\n com.sun.jersey.core.impl.provider.entity.FormProvider\n com.sun.jersey.core.impl.provider.entity.StringProvider\n com.sun.jersey.core.impl.provider.entity.ByteArrayProvider\n com.sun.jersey.core.impl.provider.entity.FileProvider\n com.sun.jersey.core.impl.provider.entity.InputStreamProvider\n com.sun.jersey.core.impl.provider.entity.DataSourceProvider\n com.sun.jersey.core.impl.provider.entity.XMLJAXBElementProvider$General\n com.sun.jersey.core.impl.provider.entity.ReaderProvider\n com.sun.jersey.core.impl.provider.entity.DocumentProvider\n com.sun.jersey.core.impl.provider.entity.SourceProvider$StreamSourceReader\n com.sun.jersey.core.impl.provider.entity.SourceProvider$SAXSourceReader\n com.sun.jersey.core.impl.provider.entity.SourceProvider$DOMSourceReader\n com.sun.jersey.json.impl.provider.entity.JSONJAXBElementProvider$General\n com.sun.jersey.json.impl.provider.entity.JSONArrayProvider$General\n com.sun.jersey.json.impl.provider.entity.JSONObjectProvider$General\n com.sun.jersey.core.impl.provider.entity.XMLRootElementProvider$General\n com.sun.jersey.core.impl.provider.entity.XMLListElementProvider$General\n com.sun.jersey.core.impl.provider.entity.XMLRootObjectProvider$General\n com.sun.jersey.core.impl.provider.entity.EntityHolderReader\n com.sun.jersey.json.impl.provider.entity.JSONRootElementProvider$General\n com.sun.jersey.json.impl.provider.entity.JSONListElementProvider$General\n com.sun.jersey.json.impl.provider.entity.JacksonProviderProxy\n","javaClassName":"javax.ws.rs.WebApplicationException"}}'. Continuing...