Configuration file

-c or --conf-file option enables to use a YAML configuration file to simplify the way you interact with the cluster.

Launch new cluster

./spark-ec2 --conf-file config.yml launch my_cluster

Example of a configuration file

slaves: 2
instance_type: r3.xlarge
region: eu-west-1
zone: eu-west-1c
ami: ami-xxxxxxxx
spot_price: 0.07
hadoop_major_version: yarn

spark_ec2_git_repo: https://github.com/tirami-su/spark-ec2
spark_ec2_git_branch: config_file-devel

copy_aws_credentials: true
no_ganglia: y
resume: on

## Network and security
key_pair: xxxxx
identity_file: xxxxx.pem

credentials: 
   aws_access_key_id: XXXXX
   aws_secret_access_key: XXXXX

Note

In this version, when -c or --config-file option is precised, only the parameters present in the configuration file are used. If there is other options in the command line, they are ignored.
The identity file must be located in ~/.ssh/.

Modificaiton

JSON to YAML Keys must be located in ~/.ssh/

Jul 20 '16 10:07 ar-ms

This is a much-requested feature from the early days of spark-ec2, so thank you for working on this.

I wonder, though, if there is a way to implement this feature without using JSON -- which IMO is not appropriate as a configuration format -- and without manually maintaining this mapping of config name to variable name.

Are there no features of argparse that can be used to offer this functionality in a more straightforward way?

Jul 21 '16 16:07 nchammas

I really enjoy using this feature, it permits me to make less error when managing a cluster, but I ignored that it was a much-requested feature.

I was thinking that JSON was an appropriate format, it seems clear and simple to manipulate. We could use ConfigParser which uses the INI file format. The YAML format could be nice too, but it requires to install PyYAML library...

I didn't find a feature of argparse that gives us what we want, but I wrote a function to avoid to maintain the manual mapping.

unneeded_opts = ("version", "help")

def mapping_conf(opt):
    normal_opt = str(opt).split("/")[-1]
    trans_opt = normal_opt.strip("-")
    trans_opt = trans_opt.replace("-", "_")
    return {trans_opt: normal_opt}

map_conf = {}
[map_conf.update(mapping_conf(opt)) for opt in parser.option_list]
[map_conf.pop(unneeded_opt, None) for unneeded_opt in unneeded_opts]

map_conf:

{
    'D': '-D',
    'additional_security_group': '--additional-security-group',
    'additional_tags': '--additional-tags',
    'ami': '--ami',
    'authorized_address': '--authorized-address',
    'conf_file': '--conf-file',
    'copy_aws_credentials': '--copy-aws-credentials',
    'delete_groups': '--delete-groups',
    'deploy_root_dir': '--deploy-root-dir',
    'ebs_vol_num': '--ebs-vol-num',
    'ebs_vol_size': '--ebs-vol-size',
    'ebs_vol_type': '--ebs-vol-type',
    'ganglia': '--ganglia',
    'hadoop_major_version': '--hadoop-major-version',
    'identity_file': '--identity-file',
    'instance_initiated_shutdown_behavior': '--instance-initiated-shutdown-behavior',
    'instance_profile_name': '--instance-profile-name',
    'instance_type': '--instance-type',
    'key_pair': '--key-pair',
    'master_instance_type': '--master-instance-type',
    'master_opts': '--master-opts',
    'no_ganglia': '--no-ganglia',
    'placement_group': '--placement-group',
    'private_ips': '--private-ips',
    'profile': '--profile',
    'region': '--region',
    'resume': '--resume',
    'slaves': '--slaves',
    'spark_ec2_git_branch': '--spark-ec2-git-branch',
    'spark_ec2_git_repo': '--spark-ec2-git-repo',
    'spark_git_repo': '--spark-git-repo',
    'spark_version': '--spark-version',
    'spot_price': '--spot-price',
    'subnet_id': '--subnet-id',
    'swap': '--swap',
    'use_existing_master': '--use-existing-master',
    'user': '--user',
    'user_data': '--user-data',
    'vpc_id': '--vpc-id',
    'wait': '--wait',
    'worker_instances': '--worker-instances',
    'zone': '--zone'
}

Do you think this solution could do the job ?

Jul 25 '16 11:07 ar-ms

Yes, I think it's a step up to have a function that does the mapping for us.

Regarding JSON vs. something else, I am biased since I chose YAML for Flintrock. 😄 The main problem I have with JSON is that you can't put comments in it. In Flintrock I also used Click to parse command-line arguments, since that lets the user load configs from a file but override individual settings at the command-line.

Obviously, we don't have to do that here, as it would be a pretty big change. If this works well enough for @shivaram then it's fine by me too.

Jul 25 '16 12:07 nchammas

I prefer YAML as well compared to JSON -- Its just more human friendly to read and format. @tirami-su Would it be possible to have a similar function that generates the mapping for YAML ?

Also for YAML parsing we should download the library similar to how we download boto -- That should make it transparent to the user ?

Jul 25 '16 18:07 shivaram

@nchammas haha you're biased but you're right :smile: !

@shivaram I will take a look and let you know, I think that it should be possible. Yep you're right, we should do this :+1: !

Jul 25 '16 18:07 ar-ms

@nchammas, @shivaram I pushed the new version with the YAML configuration file format. You can check :) !

Jul 27 '16 10:07 ar-ms

@tirami-su you will probably also want to add a note to the REAME, perhaps along with a brief example config file, to show users how this feature works.

Jul 27 '16 14:07 nchammas

@nchammas No prob, I will add a description of this feature with an example.

Jul 27 '16 15:07 ar-ms

spark-ec2
spark-ec2 copied to clipboard

Config file devel

Configuration file

Launch new cluster

Example of a configuration file

Note

Modificaiton

spark-ec2 spark-ec2 copied to clipboard

Config file devel

Configuration file

Launch new cluster

Example of a configuration file

Note

Modificaiton

spark-ec2
spark-ec2 copied to clipboard