spark-knowledgebase
spark-knowledgebase copied to clipboard
Spark Job submitted - Waiting (TaskSchedulerImpl : Initial job not accepted)
API call made to submit the Job. Response states - It is Running
On Cluster UI -
Worker (slave) - worker-20160712083825-172.31.17.189-59433 is Alive
Core 1 out of 2 used
Memory 1Gb out of 6 used
Running Application
app-20160713130056-0020 - Waiting since 5hrs
Cores - unlimited
Job Description of the Application
Active Stage
reduceByKey at /root/wordcount.py:23
Pending Stage
takeOrdered at /root/wordcount.py:26
Running Driver -
stderr log page for driver-20160713130051-0025
WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
According to Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources Slaves haven't been started - Hence it doesn't have resources.
However in my case - Slave 1 is working
According to Unable to Execute More than a spark Job "Initial job has not accepted any resources" I am using deploy-mode = cluster (not client) Since I have 1 master 1 slave and Submit API is being called via Postman / anywhere
Also the Cluster has available Cores, RAM, Memory - Still Job throws the error as conveyed by the UI
According to TaskSchedulerImpl: Initial job has not accepted any resources; I assigned
~/spark-1.5.0/conf/spark-env.sh
Spark Environment Variables
SPARK_WORKER_INSTANCES=1
SPARK_WORKER_MEMORY=1000m
SPARK_WORKER_CORES=2
Replicated those across the Slaves
sudo /root/spark-ec2/copy-dir /root/spark/conf/spark-env.sh
All the cases in the answer to above question - were applicable still no solution found. Hence because I was working with APIs and Apache SPark - maybe some other assistance is required.
Wordcount.py - My PySpark application code -
import time, re
from pyspark import SparkContext, SparkConf
def linesToWordsFunc(line):
wordsList = line.split()
wordsList = [re.sub(r'\W+', '', word) for word in wordsList]
filtered = filter(lambda word: re.match(r'\w+', word), wordsList)
return filtered
def wordsToPairsFunc(word):
return (word, 1)
def reduceToCount(a, b):
return (a + b)
def main():
conf = SparkConf().setAppName("MyApp").setMaster("spark://ec2-54-209-108-127.compute-1.amazonaws.com:7077")
sc = SparkContext(conf=conf)
rdd = sc.textFile("/user/root/In/a.txt")
words = rdd.flatMap(linesToWordsFunc)
pairs = words.map(wordsToPairsFunc)
counts = pairs.reduceByKey(reduceToCount)
# Get the first top 100 words
output = counts.takeOrdered(100, lambda (k, v): -v)
for(word, count) in output:
print word + ': ' + str(count)
sc.stop()
if __name__ == "__main__":
main()
Tried running another code - Simple python application - Still the error persists
from pyspark import SparkContext, SparkConf
logFile = "/user/root/In/a.txt"
conf = (SparkConf().set("num-executors", "1"))
sc = SparkContext(master = "spark://ec2-54-209-108-127.compute-1.amazonaws.com:7077", appName = "MyApp", conf = conf)
textFile = sc.textFile(logFile)
wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)
wordCounts.saveAsTextFile("/user/root/In/output.txt")
I had the same issue.
val sparkConf = new SparkConf()
.setAppName("Spark Pi")
.setMaster("spark://10.100.103.25:7077")
//.setMaster("local[*]")
.set("spark.serializer", classOf[KryoSerializer].getName)
.set("spark.kryo.registrator", classOf[MyRegistrator].getName)
.set("spark.executor.memory", "3g");
lazy val sc = new SparkContext(sparkConf)
def main(args: Array[String]) {
val slices =2
val n = math.min(10000L * slices, Int.MaxValue).toInt // avoid overflow
val count = sc.parallelize(1 until n, slices).map { i =>
val x = random * 2 - 1
val y = random * 2 - 1
if (x*x + y*y <= 1) 1 else 0
}.reduce(_ + _)
val pi = 4.0 * count / (n - 1)
logger.warn(s"Pi is roughly $pi")
}
Because of the firewall policies, stop the firewall and restart all services. if master and slave already started, just stop the FW, it still doesn’t work, troubled me.
./sbin/stop-all.sh
iptables -F
Hi I am kind of facing the same issue. I am deploying prediction.io on a multinode cluster where training should happen on the worker node. The worker node has been successfully registered with the master.
following are the logs of after starting slaves.sh
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 18/05/22 06:01:44 INFO Worker: Started daemon with process name: 2208@ip-172-31-6-235 18/05/22 06:01:44 INFO SignalUtils: Registered signal handler for TERM 18/05/22 06:01:44 INFO SignalUtils: Registered signal handler for HUP 18/05/22 06:01:44 INFO SignalUtils: Registered signal handler for INT 18/05/22 06:01:44 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 18/05/22 06:01:44 INFO SecurityManager: Changing view acls to: ubuntu 18/05/22 06:01:44 INFO SecurityManager: Changing modify acls to: ubuntu 18/05/22 06:01:44 INFO SecurityManager: Changing view acls groups to: 18/05/22 06:01:44 INFO SecurityManager: Changing modify acls groups to: 18/05/22 06:01:44 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ubuntu); groups with view permissions: Set(); users with modify permissions: Set(ubuntu); groups with modify permissions: Set() 18/05/22 06:01:44 INFO Utils: Successfully started service 'sparkWorker' on port 45057. 18/05/22 06:01:44 INFO Worker: Starting Spark worker 172.31.6.235:45057 with 8 cores, 24.0 GB RAM 18/05/22 06:01:44 INFO Worker: Running Spark version 2.1.1 18/05/22 06:01:44 INFO Worker: Spark home: /home/ubuntu/PredictionIO-0.12.0-incubating/vendors/spark-2.1.1-bin-hadoop2.6 18/05/22 06:01:45 INFO Utils: Successfully started service 'WorkerUI' on port 8081. 18/05/22 06:01:45 INFO WorkerWebUI: Bound WorkerWebUI to 0.0.0.0, and started at http://172.31.6.235:8081 18/05/22 06:01:45 INFO Worker: Connecting to master ip-172-31-5-119.ap-southeast-1.compute.internal:7077... 18/05/22 06:01:45 INFO TransportClientFactory: Successfully created connection to ip-172-31-5-119.ap-southeast-1.compute.internal/172.31.5.119:7077 after 19 ms (0 ms spent in bootstraps) 18/05/22 06:01:45 INFO Worker: Successfully registered with master spark://ip-172-31-5-119.ap-southeast-1.compute.internal:7077
Now the issues:
if I launch one slave on master and one slave my other node:
1.1 if the slave of the master node is given fewer resources it will give some unable to re-shuffle error.
1.2 if I give more resources to the worker on the master node the all the execution happens on master node, it does not send any execution to the slave node.
If I do not start a slave on the master node:
2.1 I get the following error:
WARN] [TaskSchedulerImpl] Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
I have assigned 24gb ram to the worker and 8 cores.
However, while I start the process following are the logs I get on slave machine:
18/05/22 06:16:00 INFO Worker: Asked to launch executor app-20180522061600-0001/0 for PredictionIO Training: com.actionml.RecommendationEngine 18/05/22 06:16:00 INFO SecurityManager: Changing view acls to: ubuntu 18/05/22 06:16:00 INFO SecurityManager: Changing modify acls to: ubuntu 18/05/22 06:16:00 INFO SecurityManager: Changing view acls groups to: 18/05/22 06:16:00 INFO SecurityManager: Changing modify acls groups to: 18/05/22 06:16:00 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ubuntu); groups with view permissions: Set(); users with modify permissions: Set(ubuntu); groups with modify permissions: Set() 18/05/22 06:16:00 INFO ExecutorRunner: Launch command: "/usr/lib/jvm/java-8-oracle/bin/java" "-cp" "./:/home/ubuntu/PredictionIO-0.12.0-incubating/vendors/spark-2.1.1-bin-hadoop2.6/conf/:/home/ubuntu/PredictionIO-0.12.0-incubating/vendors/spark-2.1.1-bin-hadoop2.6/jars/*" "-Xmx4096M" "-Dspark.driver.port=45049" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://[email protected]:45049" "--executor-id" "0" "--hostname" "172.31.6.235" "--cores" "8" "--app-id" "app-20180522061600-0001" "--worker-url" "spark://[email protected]:45057" 18/05/22 06:16:50 INFO Worker: Asked to kill executor app-20180522061600-0001/0 18/05/22 06:16:50 INFO ExecutorRunner: Runner thread for executor app-20180522061600-0001/0 interrupted 18/05/22 06:16:50 INFO ExecutorRunner: Killing process! 18/05/22 06:16:51 INFO Worker: Executor app-20180522061600-0001/0 finished with state KILLED exitStatus 143 18/05/22 06:16:51 INFO Worker: Cleaning up local directories for application app-20180522061600-0001 18/05/22 06:16:51 INFO ExternalShuffleBlockResolver: Application app-20180522061600-0001 removed, cleanupLocalDirs = true
Can somebody help me debuging the issue? Thanks!
**Looks like an "open source software" thing...
Da Real Prophessionals to not ask tis sarta quastianzz
Since last post this Da Real Prophessional have figured out how to run spark cluster how to open following ports and configure those on the master node ( it's intended to be configured on master node, for anyone interested )
spark.blockManager.port 10025
spark.driver.blockManager.port 10026
spark.driver.port 10027
using \conf\spark-defaults.conf configuration file. Ran into blocked StackOverflow account here
https://stackoverflow.com/questions/59209920/spark-standalone-cluster-configuration
with private feedback as following:
and now rapidly moving to get a handle on task / executor to worker resolution. I would imagine one needs to write a block of code extending functionality of either logging or task infrastructure, that will report, at appropriate time in the life of task / executor instance, an event, indicating state of the instance and machine ( or possibly worker host process + machine ), an trivial from the point of engineering question, but as far as social interactions and willingness to extend support, to overcome these details from application maintainers, is rather betraying what is so wrong with Open Source community at its core.
I hate you guys.
Barring personal feelings, the next step, depending on findings, might end up to be custom dispatcher for standalone cluster ( if I really do not like going to Lynux, and I don't like going Lynux for the same reason, I don't like Open Source software ) or researching how to better setup cluster using Yarn or some other cluster resource management infrastructure implementation on Windows, and operating system, where stupid idiots, such as myself, still have a hope to get a fair shake from someone, who was writing operating system, just because he might like it, unlike real professionals, who write Lynux. And that step might end up more involved, than buying 1 terabyte of memory for my company SQL Server and sending our workload into "In Memory" implementation of hecatone. Which, unlike Spark has indexes and other nifty features, to make processing acceptably fast.
No hard feelings...