spark-ec2 icon indicating copy to clipboard operation
spark-ec2 copied to clipboard

Elastic spark cluster

Open ar-ms opened this issue 8 years ago • 4 comments

Elastic Spark Cluster

Prerequisite

A running Spark cluster.

add-slaves

./spark-ec2 -k key-name -i file.pem add-slaves cluster_name

This command adds a new slave to the cluster. To add more slave use -s or --slaves option. You could precise to start the slaves as spot instances using the option --spot-price and specify the type of instances using --instance-type or -t.

remove-slaves

./spark-ec2 -k key-name -i file.pem remove-slaves cluster_name

The last command removes a slave if this is not the last slave in the cluster. To remove more slaves indicates the number using -s or --slaves option. If the defined number is more than the number of slaves present on the cluster or the number is equal to -1, every slave will be removed but one slave will be kept by the cluster.

Note:

  • The slaves are removed randomly.
  • One slave will be kept.

Testing command shortcuts

  • Environment variables
KEY_NAME="key-name"
ID_FILE="file.pem"
SPARK_EC2_GIT_REPO="https://github.com/tirami-su/spark-ec2"
SPARK_EC2_GIT_BRANCH="elastic-spark-cluster"
CLUSTER_NAME="cluster_name"
  • Creating a cluster ./spark-ec2 -k $KEY_NAME -i $ID_FILE --spark-ec2-git-repo=$SPARK_EC2_GIT_REPO --spark-ec2-git-branch=$SPARK_EC2_GIT_BRANCH launch $CLUSTER_NAME

    slave = 1

  • Trying to remove a slave ./spark-ec2 -k $KEY_NAME -i $ID_FILE remove-slaves $CLUSTER_NAME

    slave = 1

  • Adding a slave ./spark-ec2 -k $KEY_NAME -i $ID_FILE add-slaves $CLUSTER_NAME

    slave = 2

  • Adding three more slaves as spot instances ./spark-ec2 -s 3 -k $KEY_NAME -i $ID_FILE --spot-price=0.05 add-slaves $CLUSTER_NAME

    slave = 5

  • Removing two slaves ./spark-ec2 -s 2 -k $KEY_NAME -i $ID_FILE remove-slaves $CLUSTER_NAME

    slave = 3

  • Removing all slaves (except one slave) ./spark-ec2 -s -1 -k $KEY_NAME -i $ID_FILE remove-slaves $CLUSTER_NAME

    slave = 1

  • Destroy cluster ./spark-ec2 -s -1 -k $KEY_NAME -i $ID_FILE detroy $CLUSTER_NAME

ar-ms avatar Jul 18 '16 10:07 ar-ms

This is great. Thanks @tirami-su -- I'll take a look at this soon.

@nchammas Would be great if you also took a look.

shivaram avatar Jul 18 '16 17:07 shivaram

Thanks @shivaram ! Super :D !

ar-ms avatar Jul 18 '16 18:07 ar-ms

I won't be able to review this PR in detail since I'm struggling to find time to complete even my own implementation of this feature for Flintrock.

However, some high-level questions:

  • How does add-slaves know what version of Spark to install and other setup to do on the new slaves? Can the user inadvertently create a broken cluster with a bad call to add-slaves?
  • Can you explain the high-level purpose of entities.generic and how it fits in with the current code base?

nchammas avatar Jul 21 '16 17:07 nchammas

Ho...

How does add-slaves know what version of Spark to install and other setup to do on the new slaves? Can the user inadvertently create a broken cluster with a bad call to add-slaves?

There are three types of deploy folders:

  • deploy.generic:
deploy.generic/
└── root
    └── spark-ec2
        └── ec2-variables.sh
  • entities.generic:
entities.generic/
└── root
    └── spark-ec2
        ├── masters
        └── slaves
  • new_slaves.generic
new_slaves.generic/
└── root
    └── spark-ec2
        └── new_slaves

deploy.generic which contains ec2-variables (SPARK_VERSION, HADOOP_MAJOR_VERSION, MODULES, ...) will be deployed only when the cluster is created. add-slaves uses the settings from ec2-variables.sh to create new slaves so it will not break the cluster since it will use the same versions.

There is a problem, if you use different instance types. For example, if you start a cluster with some big instances and after that you add new small instances. As the settings for memory and CPU are evaluated only on the master and the first slave in deploy_templates.py, if the new slaves don't have the sufficient memory or CPU, the jobs will fail.

Another "problem" is that if you modify files present in templates, like core-site.xml, etc.. directly in a running cluster, the modifications will be overridden by the original version present if you launch new instances, because the setup_new_slaves launches deploy_templates.py.

Can you explain the high-level purpose of entities.generic and how it fits in with the current code base?

The purpose of entities.generic is to separate the settings that will changes (masters, slaves(mostly)) from the others. entities.generic deploys only masters and slaves files, this separation enables us to not override the ec2-variables.sh when launch new slaves and so not to corrupt the cluster.

Thank you for this review and for the questions. If the answers aren't clear, feel free to ask me for details and same if you have other questions :) !

ar-ms avatar Jul 25 '16 11:07 ar-ms