Elastic Spark Cluster

Prerequisite

A running Spark cluster.

add-slaves

./spark-ec2 -k key-name -i file.pem add-slaves cluster_name

This command adds a new slave to the cluster. To add more slave use -s or --slaves option. You could precise to start the slaves as spot instances using the option --spot-price and specify the type of instances using --instance-type or -t.

remove-slaves

./spark-ec2 -k key-name -i file.pem remove-slaves cluster_name

The last command removes a slave if this is not the last slave in the cluster. To remove more slaves indicates the number using -s or --slaves option. If the defined number is more than the number of slaves present on the cluster or the number is equal to -1, every slave will be removed but one slave will be kept by the cluster.

Note:

The slaves are removed randomly.
One slave will be kept.

Testing command shortcuts

Environment variables

KEY_NAME="key-name"
ID_FILE="file.pem"
SPARK_EC2_GIT_REPO="https://github.com/tirami-su/spark-ec2"
SPARK_EC2_GIT_BRANCH="elastic-spark-cluster"
CLUSTER_NAME="cluster_name"

Creating a cluster ./spark-ec2 -k $KEY_NAME -i $ID_FILE --spark-ec2-git-repo=$SPARK_EC2_GIT_REPO --spark-ec2-git-branch=$SPARK_EC2_GIT_BRANCH launch $CLUSTER_NAME

slave = 1
Trying to remove a slave ./spark-ec2 -k $KEY_NAME -i $ID_FILE remove-slaves $CLUSTER_NAME

slave = 1
Adding a slave ./spark-ec2 -k $KEY_NAME -i $ID_FILE add-slaves $CLUSTER_NAME

slave = 2
Adding three more slaves as spot instances ./spark-ec2 -s 3 -k $KEY_NAME -i $ID_FILE --spot-price=0.05 add-slaves $CLUSTER_NAME

slave = 5
Removing two slaves ./spark-ec2 -s 2 -k $KEY_NAME -i $ID_FILE remove-slaves $CLUSTER_NAME

slave = 3
Removing all slaves (except one slave) ./spark-ec2 -s -1 -k $KEY_NAME -i $ID_FILE remove-slaves $CLUSTER_NAME

slave = 1
Destroy cluster ./spark-ec2 -s -1 -k $KEY_NAME -i $ID_FILE detroy $CLUSTER_NAME

Jul 18 '16 10:07 ar-ms

This is great. Thanks @tirami-su -- I'll take a look at this soon.

@nchammas Would be great if you also took a look.

Jul 18 '16 17:07 shivaram

Thanks @shivaram ! Super :D !

Jul 18 '16 18:07 ar-ms

I won't be able to review this PR in detail since I'm struggling to find time to complete even my own implementation of this feature for Flintrock.

However, some high-level questions:

How does add-slaves know what version of Spark to install and other setup to do on the new slaves? Can the user inadvertently create a broken cluster with a bad call to add-slaves?
Can you explain the high-level purpose of entities.generic and how it fits in with the current code base?

Jul 21 '16 17:07 nchammas

Ho...

How does add-slaves know what version of Spark to install and other setup to do on the new slaves? Can the user inadvertently create a broken cluster with a bad call to add-slaves?

There are three types of deploy folders:

deploy.generic:

deploy.generic/
└── root
    └── spark-ec2
        └── ec2-variables.sh

entities.generic:

entities.generic/
└── root
    └── spark-ec2
        ├── masters
        └── slaves

new_slaves.generic

new_slaves.generic/
└── root
    └── spark-ec2
        └── new_slaves

deploy.generic which contains ec2-variables (SPARK_VERSION, HADOOP_MAJOR_VERSION, MODULES, ...) will be deployed only when the cluster is created. add-slaves uses the settings from ec2-variables.sh to create new slaves so it will not break the cluster since it will use the same versions.

There is a problem, if you use different instance types. For example, if you start a cluster with some big instances and after that you add new small instances. As the settings for memory and CPU are evaluated only on the master and the first slave in deploy_templates.py, if the new slaves don't have the sufficient memory or CPU, the jobs will fail.

Another "problem" is that if you modify files present in templates, like core-site.xml, etc.. directly in a running cluster, the modifications will be overridden by the original version present if you launch new instances, because the setup_new_slaves launches deploy_templates.py.

Can you explain the high-level purpose of entities.generic and how it fits in with the current code base?

The purpose of entities.generic is to separate the settings that will changes (masters, slaves(mostly)) from the others. entities.generic deploys only masters and slaves files, this separation enables us to not override the ec2-variables.sh when launch new slaves and so not to corrupt the cluster.

Thank you for this review and for the questions. If the answers aren't clear, feel free to ask me for details and same if you have other questions :) !

Jul 25 '16 11:07 ar-ms

spark-ec2
spark-ec2 copied to clipboard