raydp icon indicating copy to clipboard operation
raydp copied to clipboard

After updating rayDP, spark = raydp.init_spark() cannot be started

Open YeahNew opened this issue 3 years ago • 12 comments

I update ray[all]==0.1.1 to ray[all]==1.5.0, raydp==0.1.0 to raydp==0.3.0. After re-running the previous program, it was found that spark could not be started. It is normal to modify it to the original version. --------init code-------- ray.init(address='auto', _redis_password='5241590000000000')

Edage_data_PATH = args.full_graph_path
Node_data_PATH = args.full_graph_feat_path


# After initialize ray cluster, you can use the raydp api to get a spark session
app_name = "raydp and raysgd"
num_executors = 3
cores_per_executor = 45
memory_per_executor = 40GB
spark = raydp.init_spark(app_name,
                         num_executors,
                         cores_per_executor,
                         memory_per_executor)

# # schema for Node_features_data
long_cols = list(range(0, 1))
float_cols = list(range(1, 1 + 14))
label_cols = list(range(1 + 14, 16))

long_fields = [('node_id') for i in long_cols]
float_fields = [('feat_%d') % i for i in float_cols]
label_fields = [('label') for i in label_cols]

schema1 = (long_fields +
          float_fields +
          label_fields)

schema2 = ["src_node_id","dst_node_id"]

# # Here we just use a subset of the training data
Node_features_data = spark.read.format("csv").option("header", "False") \
    .option("inferSchema", "true") \
    .load(Node_data_PATH) \
    .toDF(*schema1) \

--------init code-------- #The error message is as follows: ------------------error------------- To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 2021-08-24 16:51:50,192 INFO RayAppMaster$RayAppMasterEndpoint [dispatcher-event-loop-7]: Registering app RayDP with DGL for abfusion_v2(Hotel dataset) 2021-08-24 16:51:50,196 INFO RayAppMaster$RayAppMasterEndpoint [dispatcher-event-loop-7]: Registered app RayDP with DGL for abfusion_v2(Hotel dataset) with ID app-20210824165150-0000 Exception in thread "dispatcher-event-loop-7" java.lang.NoSuchMethodError: io.ray.api.call.ActorCreator.setJvmOptions(Ljava/lang/String;)Lio/ray/api/call/ActorCreator; at org.apache.spark.raydp.RayExecutorUtils.createExecutorActor(RayExecutorUtils.java:45) at org.apache.spark.deploy.raydp.RayAppMaster$RayAppMasterEndpoint.org$apache$spark$deploy$raydp$RayAppMaster$RayAppMasterEndpoint$$requestNewExecutor(RayAppMaster.scala:227) at org.apache.spark.deploy.raydp.RayAppMaster$RayAppMasterEndpoint.$anonfun$schedule$1(RayAppMaster.scala:214) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158) at org.apache.spark.deploy.raydp.RayAppMaster$RayAppMasterEndpoint.org$apache$spark$deploy$raydp$RayAppMaster$RayAppMasterEndpoint$$schedule(RayAppMaster.scala:213) at org.apache.spark.deploy.raydp.RayAppMaster$RayAppMasterEndpoint$$anonfun$receive$1.applyOrElse(RayAppMaster.scala:121) at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 21/08/24 16:52:45 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources 21/08/24 16:53:00 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources 21/08/24 16:53:15 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources -------error----------- The current version of jdk is 1.8. Does it need to be updated?

YeahNew avatar Aug 24 '21 09:08 YeahNew

@YeahNew , JDK 1.8 is fine. It looks like a version mismatch issue as I saw a NoSuchMethodError. Just to check, did you upgrade Ray and RayDP on every node of your cluster?

carsonwang avatar Aug 24 '21 09:08 carsonwang

@YeahNew , JDK 1.8 is fine. It looks like a version mismatch issue as I saw a NoSuchMethodError. Just to check, did you upgrade Ray and RayDP on every node of your cluster?

yes,I update every node in my cluster.

YeahNew avatar Aug 24 '21 10:08 YeahNew

@YeahNew , JDK 1.8 is fine. It looks like a version mismatch issue as I saw a NoSuchMethodError. Just to check, did you upgrade Ray and RayDP on every node of your cluster?

I have experimented with many versions, only when raydp==0.1.1 ray[all]=1.1.0/1.2.0, the program can run normally, and other version combinations will have various error problems.

YeahNew avatar Aug 24 '21 11:08 YeahNew

Hi @YeahNew , the latest stable version of raydp support ray 1.3.0. Ray 1.4.0 changes the signature, that's why you see this message. Have you tried raydp 0.3.0 with Ray 1.3.0? If you want to use newer versions of ray, you can also try our nightly versions.

kira-lin avatar Aug 25 '21 01:08 kira-lin

Hi @YeahNew , the latest stable version of raydp support ray 1.3.0. Ray 1.4.0 changes the signature, that's why you see this message. Have you tried raydp 0.3.0 with Ray 1.3.0? If you want to use newer versions of ray, you can also try our nightly versions.

Yes, I tried.

YeahNew avatar Aug 25 '21 06:08 YeahNew

What's the error message?

kira-lin avatar Aug 25 '21 06:08 kira-lin

What's the error message?

when I try raydp 0.3.0 with Ray 1.3.0. the error message as follows:

-------------message------------

Connected to pydev debugger (build 201.8538.36) Using backend: pytorch Current Path: /home/yexin/workSpace/PyCharmWorks/HikVision/rayhik/rayABV2 2021-08-25 17:09:30,123 INFO worker.py:641 -- Connecting to existing Ray cluster at address: 10.3.68.117:9999 Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 2021-08-25 17:09:32,063 ERROR DefaultRayRuntimeFactory [Thread-2]: Failed to initialize ray runtime, with config {"ray":{"address":"10.3.68.117:9999","head-args":[],"job":{"code-search-path":"","id":"","jvm-options":[],"num-java-workers-per-process":1,"worker-env":{}},"logging":{"dir":"/tmp/ray/session_2021-08-25_17-09-05_973250_43058/logs","level":"INFO","max-backup-files":10,"max-file-size":"500MB","pattern":"%d{yyyy-MM-dd HH:mm:ss,SSS} %p %c{1} [%t]: %m%n"},"node-ip":"10.3.68.117","object-store":{"socket-name":"/tmp/ray/session_2021-08-25_17-09-05_973250_43058/sockets/plasma_store"},"raylet":{"config":{"num_workers_per_process_java":"1"},"node-manager-port":40525,"socket-name":"/tmp/ray/session_2021-08-25_17-09-05_973250_43058/sockets/raylet"},"redis":{"password":"5241590000000000"},"resources":"CPU:4","run-mode":"CLUSTER","session-dir":"/tmp/ray/session_2021-08-25_17-09-05_973250_43058"}} java.lang.RuntimeException: Failed to get address info. Output: {'object_store_address': '/tmp/ray/session_2021-08-25_17-09-05_973250_43058/sockets/plasma_store', 'raylet_socket_name': '/tmp/ray/session_2021-08-25_17-09-05_973250_43058/sockets/raylet', 'node_manager_port': 40525} at io.ray.runtime.runner.RunManager.getAddressInfoAndFillConfig(RunManager.java:88) at io.ray.runtime.RayNativeRuntime.start(RayNativeRuntime.java:79) at io.ray.runtime.DefaultRayRuntimeFactory.createRayRuntime(DefaultRayRuntimeFactory.java:39) at io.ray.api.Ray.init(Ray.java:39) at io.ray.api.Ray.init(Ray.java:26) at org.apache.spark.deploy.raydp.AppMasterJavaBridge.startUpAppMaster(AppMasterJavaBridge.scala:41) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) Caused by: com.google.gson.JsonSyntaxException: com.google.gson.stream.MalformedJsonException: Use JsonReader.setLenient(true) to accept malformed JSON at line 1 column 3 path $ at com.google.gson.JsonParser.parse(JsonParser.java:65) at com.google.gson.JsonParser.parse(JsonParser.java:45) at io.ray.runtime.runner.RunManager.getAddressInfoAndFillConfig(RunManager.java:83) ... 16 more Caused by: com.google.gson.stream.MalformedJsonException: Use JsonReader.setLenient(true) to accept malformed JSON at line 1 column 3 path $ at com.google.gson.stream.JsonReader.syntaxError(JsonReader.java:1568) at com.google.gson.stream.JsonReader.checkLenient(JsonReader.java:1409) at com.google.gson.stream.JsonReader.doPeek(JsonReader.java:542) at com.google.gson.stream.JsonReader.peek(JsonReader.java:425) at com.google.gson.JsonParser.parse(JsonParser.java:60) Traceback (most recent call last): File "/root/miniconda3/lib/python3.7/site-packages/raydp/context.py", line 122, in init_spark return _global_spark_context.get_or_create_session() File "/root/miniconda3/lib/python3.7/site-packages/raydp/context.py", line 68, in get_or_create_session spark_cluster = self._get_or_create_spark_cluster() File "/root/miniconda3/lib/python3.7/site-packages/raydp/context.py", line 62, in _get_or_create_spark_cluster self._spark_cluster = SparkCluster(self._configs) File "/root/miniconda3/lib/python3.7/site-packages/raydp/spark/ray_cluster.py", line 33, in init self._set_up_master(None, None) File "/root/miniconda3/lib/python3.7/site-packages/raydp/spark/ray_cluster.py", line 39, in _set_up_master self._app_master_bridge.start_up() File "/root/miniconda3/lib/python3.7/site-packages/raydp/spark/ray_cluster_master.py", line 59, in start_up self._create_app_master(extra_classpath) File "/root/miniconda3/lib/python3.7/site-packages/raydp/spark/ray_cluster_master.py", line 169, in _create_app_master self._app_master_java_bridge.startUpAppMaster(extra_classpath) File "/root/miniconda3/lib/python3.7/site-packages/py4j/java_gateway.py", line 1305, in call answer, self.gateway_client, self.target_id, self.name) File "/root/miniconda3/lib/python3.7/site-packages/py4j/protocol.py", line 328, in get_return_value format(target_id, ".", name), value) py4j.protocol.Py4JJavaError: An error occurred while calling o0.startUpAppMaster. : java.lang.RuntimeException: Failed to initialize Ray runtime. at io.ray.api.Ray.init(Ray.java:28) at org.apache.spark.deploy.raydp.AppMasterJavaBridge.startUpAppMaster(AppMasterJavaBridge.scala:41) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.RuntimeException: Failed to initialize ray runtime at io.ray.runtime.DefaultRayRuntimeFactory.createRayRuntime(DefaultRayRuntimeFactory.java:43) at io.ray.api.Ray.init(Ray.java:39) at io.ray.api.Ray.init(Ray.java:26) ---------meaasge-----------

YeahNew avatar Aug 25 '21 09:08 YeahNew

This is weird. I just tried and it works. Just to make sure, did you restart the ray cluster after installing ray 1.3.0? It complains MalformedJsonException, is there any illegal character in the path?

kira-lin avatar Aug 25 '21 13:08 kira-lin

This is weird. I just tried and it works. Just to make sure, did you restart the ray cluster after installing ray 1.3.0? It complains MalformedJsonException, is there any illegal character in the path?

YeahNew avatar Aug 26 '21 02:08 YeahNew

This is weird. I just tried and it works. Just to make sure, did you restart the ray cluster after installing ray 1.3.0? It complains MalformedJsonException, is there any illegal character in the path?

@kira-lin After trying many times, the problem is still the same. there is no illegal character in the path. There is no problem with running ray or pyspark program alone .

The following error seems to be related to raydp

File "/root/miniconda3/lib/python3.7/site-packages/raydp/spark/ray_cluster.py", line 33, in init self._set_up_master(None, None) File "/root/miniconda3/lib/python3.7/site-packages/raydp/spark/ray_cluster.py", line 39, in _set_up_master self._app_master_bridge.start_up() File "/root/miniconda3/lib/python3.7/site-packages/raydp/spark/ray_cluster_master.py", line 59, in start_up self._create_app_master(extra_classpath) File "/root/miniconda3/lib/python3.7/site-packages/raydp/spark/ray_cluster_master.py", line 169, in _create_app_master self._app_master_java_bridge.startUpAppMaster(extra_classpath) File "/root/miniconda3/lib/python3.7/site-packages/py4j/java_gateway.py", line 1305, in call answer, self.gateway_client, self.target_id, self.name) File "/root/miniconda3/lib/python3.7/site-packages/py4j/protocol.py", line 328, in get_return_value format(target_id, ".", name), value) py4j.protocol.Py4JJavaError: An error occurred while calling o0.startUpAppMaster. : java.lang.RuntimeException: Failed to initialize Ray runtime. at io.ray.api.Ray.init(Ray.java:28) at org.apache.spark.deploy.raydp.AppMasterJavaBridge.startUpAppMaster(AppMasterJavaBridge.scala:41)

YeahNew avatar Aug 30 '21 07:08 YeahNew

Confirmed with @YeahNew offline, the error only happens when running in IDE. There is no error if running in command line.

carsonwang avatar Aug 30 '21 07:08 carsonwang

Confirmed with @YeahNew offline, the error only happens when running in IDE. There is no error if running in command line.

Yes, Thank you very much Carson.

YeahNew avatar Aug 30 '21 08:08 YeahNew

close as stale

kira-lin avatar Apr 14 '23 08:04 kira-lin