spark-ec2
spark-ec2 copied to clipboard
Elastic spark cluster
Elastic Spark Cluster
Prerequisite
A running Spark cluster.
add-slaves
./spark-ec2 -k key-name -i file.pem add-slaves cluster_name
This command adds a new slave to the cluster. To add more slave use -s
or --slaves
option. You could precise to start the slaves as spot instances using the option --spot-price
and specify the type of instances using --instance-type
or -t
.
remove-slaves
./spark-ec2 -k key-name -i file.pem remove-slaves cluster_name
The last command removes a slave if this is not the last slave in the cluster. To remove more slaves indicates the number using -s
or --slaves
option.
If the defined number is more than the number of slaves present on the cluster or the number is equal to -1, every slave will be removed but one slave will be kept by the cluster.
Note:
- The slaves are removed randomly.
- One slave will be kept.
Testing command shortcuts
- Environment variables
KEY_NAME="key-name"
ID_FILE="file.pem"
SPARK_EC2_GIT_REPO="https://github.com/tirami-su/spark-ec2"
SPARK_EC2_GIT_BRANCH="elastic-spark-cluster"
CLUSTER_NAME="cluster_name"
-
Creating a cluster
./spark-ec2 -k $KEY_NAME -i $ID_FILE --spark-ec2-git-repo=$SPARK_EC2_GIT_REPO --spark-ec2-git-branch=$SPARK_EC2_GIT_BRANCH launch $CLUSTER_NAME
slave = 1
-
Trying to remove a slave
./spark-ec2 -k $KEY_NAME -i $ID_FILE remove-slaves $CLUSTER_NAME
slave = 1
-
Adding a slave
./spark-ec2 -k $KEY_NAME -i $ID_FILE add-slaves $CLUSTER_NAME
slave = 2
-
Adding three more slaves as spot instances
./spark-ec2 -s 3 -k $KEY_NAME -i $ID_FILE --spot-price=0.05 add-slaves $CLUSTER_NAME
slave = 5
-
Removing two slaves
./spark-ec2 -s 2 -k $KEY_NAME -i $ID_FILE remove-slaves $CLUSTER_NAME
slave = 3
-
Removing all slaves (except one slave)
./spark-ec2 -s -1 -k $KEY_NAME -i $ID_FILE remove-slaves $CLUSTER_NAME
slave = 1
-
Destroy cluster
./spark-ec2 -s -1 -k $KEY_NAME -i $ID_FILE detroy $CLUSTER_NAME
This is great. Thanks @tirami-su -- I'll take a look at this soon.
@nchammas Would be great if you also took a look.
Thanks @shivaram ! Super :D !
I won't be able to review this PR in detail since I'm struggling to find time to complete even my own implementation of this feature for Flintrock.
However, some high-level questions:
- How does
add-slaves
know what version of Spark to install and other setup to do on the new slaves? Can the user inadvertently create a broken cluster with a bad call toadd-slaves
? - Can you explain the high-level purpose of
entities.generic
and how it fits in with the current code base?
Ho...
How does add-slaves know what version of Spark to install and other setup to do on the new slaves? Can the user inadvertently create a broken cluster with a bad call to add-slaves?
There are three types of deploy folders:
- deploy.generic:
deploy.generic/
└── root
└── spark-ec2
└── ec2-variables.sh
- entities.generic:
entities.generic/
└── root
└── spark-ec2
├── masters
└── slaves
- new_slaves.generic
new_slaves.generic/
└── root
└── spark-ec2
└── new_slaves
deploy.generic which contains ec2-variables (SPARK_VERSION, HADOOP_MAJOR_VERSION, MODULES, ...) will be deployed only when the cluster is created. add-slaves
uses the settings from ec2-variables.sh
to create new slaves so it will not break the cluster since it will use the same versions.
There is a problem, if you use different instance types. For example, if you start a cluster with some big instances and after that you add new small instances. As the settings for memory and CPU are evaluated only on the master and the first slave in deploy_templates.py
, if the new slaves don't have the sufficient memory or CPU, the jobs will fail.
Another "problem" is that if you modify files present in templates, like core-site.xml
, etc.. directly in a running cluster, the modifications will be overridden by the original version present if you launch new instances, because the setup_new_slaves
launches deploy_templates.py
.
Can you explain the high-level purpose of entities.generic and how it fits in with the current code base?
The purpose of entities.generic
is to separate the settings that will changes (masters, slaves(mostly)) from the others. entities.generic
deploys only masters
and slaves
files, this separation enables us to not override the ec2-variables.sh
when launch new slaves and so not to corrupt the cluster.
Thank you for this review and for the questions. If the answers aren't clear, feel free to ask me for details and same if you have other questions :) !