dolphinscheduler
dolphinscheduler copied to clipboard
[Improvement][Spark] Support Local Spark Cluster
Search before asking
- [X] I had searched in the issues and found no similar feature requirement.
Description
When a Spark Task executes spark-submit, the 'cluster' and 'client' deploy maps to --master yarn or --master k8s://.... I would like option to use a local Spark cluster. In other words, the equivalent spark-submit option is:--master spark://<hostname>:<port>
Are you willing to submit a PR?
- [ ] Yes I am willing to submit a PR!
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
I would like to have a try on this issue.
Current workaround for me is to pass --master ... --deploy-mode cluster in the extra options. Since spark-submit will use the last values, this will send task to local cluster. For example look at this log which has my own --master option which overrides Dolphin --master local:
[INFO] 2024-02-02 14:27:38.934 +0000 - Final Shell file is :
#!/bin/bash
BASEDIR=$(cd `dirname $0`; pwd)
cd $BASEDIR
export SPARK_HOME=/opt/spark-3.5.0-bin-hadoop3
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
${SPARK_HOME}/bin/spark-submit --master local
--class com.example.monitor.ScanMonitor --conf spark.driver.cores=1 --conf spark.driver.memory=512M
--conf spark.executor.instances=2 --conf spark.executor.cores=2
--conf spark.executor.memory=2G
--master spark://devel:7077 --deploy-mode cluster
file:/opt/apache-dolphinscheduler-3.2.0-bin/standalone-server/files/default/resources/monitor-0.1-jdk11.jar producer
...
24/02/02 14:27:54 INFO ClientEndpoint: Driver successfully submitted as driver-20240202142754-0003
2024-02-02 14:28:00.038 +0000 - ->
24/02/02 14:27:59 INFO ClientEndpoint: State of driver-20240202142754-0003 is RUNNING
24/02/02 14:27:59 INFO ClientEndpoint: Driver running on 172.16.254.204:35595 (worker-20240202141308-172.16.254.204-35595)
24/02/02 14:27:59 INFO ClientEndpoint: spark-submit not configured to wait for completion, exiting spark-submit JVM.
Current workaround for me is to pass
--master ... --deploy-mode clusterin the extra options. Since spark-submit will use the last values, this will send task to local cluster. For example look at this log which has my own--masteroption which overrides Dolphin--master local:[INFO] 2024-02-02 14:27:38.934 +0000 - Final Shell file is : #!/bin/bash BASEDIR=$(cd `dirname $0`; pwd) cd $BASEDIR export SPARK_HOME=/opt/spark-3.5.0-bin-hadoop3 export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64 ${SPARK_HOME}/bin/spark-submit --master local --class com.example.monitor.ScanMonitor --conf spark.driver.cores=1 --conf spark.driver.memory=512M --conf spark.executor.instances=2 --conf spark.executor.cores=2 --conf spark.executor.memory=2G --master spark://devel:7077 --deploy-mode cluster file:/opt/apache-dolphinscheduler-3.2.0-bin/standalone-server/files/default/resources/monitor-0.1-jdk11.jar producer ... 24/02/02 14:27:54 INFO ClientEndpoint: Driver successfully submitted as driver-20240202142754-0003 2024-02-02 14:28:00.038 +0000 - -> 24/02/02 14:27:59 INFO ClientEndpoint: State of driver-20240202142754-0003 is RUNNING 24/02/02 14:27:59 INFO ClientEndpoint: Driver running on 172.16.254.204:35595 (worker-20240202141308-172.16.254.204-35595) 24/02/02 14:27:59 INFO ClientEndpoint: spark-submit not configured to wait for completion, exiting spark-submit JVM.
Thanks @git-blame for quick work around, indeed it will work in the extra options, but master is a important parameter among spark as mentioned.
I will communicate with community to see if it is by design in previous discussions.
If not, I will add paramater into spark task.
This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.
still working
This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.
still working