spark-ec2
spark-ec2 copied to clipboard
Config file devel
Configuration file
-c
or --conf-file
option enables to use a YAML configuration file to simplify the way you interact with the cluster.
Launch new cluster
./spark-ec2 --conf-file config.yml launch my_cluster
Example of a configuration file
slaves: 2
instance_type: r3.xlarge
region: eu-west-1
zone: eu-west-1c
ami: ami-xxxxxxxx
spot_price: 0.07
hadoop_major_version: yarn
spark_ec2_git_repo: https://github.com/tirami-su/spark-ec2
spark_ec2_git_branch: config_file-devel
copy_aws_credentials: true
no_ganglia: y
resume: on
## Network and security
key_pair: xxxxx
identity_file: xxxxx.pem
credentials:
aws_access_key_id: XXXXX
aws_secret_access_key: XXXXX
Note
- In this version, when
-c
or--config-file
option is precised, only the parameters present in the configuration file are used. If there is other options in the command line, they are ignored. - The identity file must be located in
~/.ssh/
.
Modificaiton
JSON to YAML
Keys must be located in ~/.ssh/
This is a much-requested feature from the early days of spark-ec2, so thank you for working on this.
I wonder, though, if there is a way to implement this feature without using JSON -- which IMO is not appropriate as a configuration format -- and without manually maintaining this mapping of config name to variable name.
Are there no features of argparse that can be used to offer this functionality in a more straightforward way?
I really enjoy using this feature, it permits me to make less error when managing a cluster, but I ignored that it was a much-requested feature.
I was thinking that JSON was an appropriate format, it seems clear and simple to manipulate. We could use ConfigParser which uses the INI file format. The YAML format could be nice too, but it requires to install PyYAML library...
I didn't find a feature of argparse that gives us what we want, but I wrote a function to avoid to maintain the manual mapping.
unneeded_opts = ("version", "help")
def mapping_conf(opt):
normal_opt = str(opt).split("/")[-1]
trans_opt = normal_opt.strip("-")
trans_opt = trans_opt.replace("-", "_")
return {trans_opt: normal_opt}
map_conf = {}
[map_conf.update(mapping_conf(opt)) for opt in parser.option_list]
[map_conf.pop(unneeded_opt, None) for unneeded_opt in unneeded_opts]
map_conf:
{
'D': '-D',
'additional_security_group': '--additional-security-group',
'additional_tags': '--additional-tags',
'ami': '--ami',
'authorized_address': '--authorized-address',
'conf_file': '--conf-file',
'copy_aws_credentials': '--copy-aws-credentials',
'delete_groups': '--delete-groups',
'deploy_root_dir': '--deploy-root-dir',
'ebs_vol_num': '--ebs-vol-num',
'ebs_vol_size': '--ebs-vol-size',
'ebs_vol_type': '--ebs-vol-type',
'ganglia': '--ganglia',
'hadoop_major_version': '--hadoop-major-version',
'identity_file': '--identity-file',
'instance_initiated_shutdown_behavior': '--instance-initiated-shutdown-behavior',
'instance_profile_name': '--instance-profile-name',
'instance_type': '--instance-type',
'key_pair': '--key-pair',
'master_instance_type': '--master-instance-type',
'master_opts': '--master-opts',
'no_ganglia': '--no-ganglia',
'placement_group': '--placement-group',
'private_ips': '--private-ips',
'profile': '--profile',
'region': '--region',
'resume': '--resume',
'slaves': '--slaves',
'spark_ec2_git_branch': '--spark-ec2-git-branch',
'spark_ec2_git_repo': '--spark-ec2-git-repo',
'spark_git_repo': '--spark-git-repo',
'spark_version': '--spark-version',
'spot_price': '--spot-price',
'subnet_id': '--subnet-id',
'swap': '--swap',
'use_existing_master': '--use-existing-master',
'user': '--user',
'user_data': '--user-data',
'vpc_id': '--vpc-id',
'wait': '--wait',
'worker_instances': '--worker-instances',
'zone': '--zone'
}
Do you think this solution could do the job ?
Yes, I think it's a step up to have a function that does the mapping for us.
Regarding JSON vs. something else, I am biased since I chose YAML for Flintrock. 😄 The main problem I have with JSON is that you can't put comments in it. In Flintrock I also used Click to parse command-line arguments, since that lets the user load configs from a file but override individual settings at the command-line.
Obviously, we don't have to do that here, as it would be a pretty big change. If this works well enough for @shivaram then it's fine by me too.
I prefer YAML as well compared to JSON -- Its just more human friendly to read and format. @tirami-su Would it be possible to have a similar function that generates the mapping for YAML ?
Also for YAML parsing we should download the library similar to how we download boto -- That should make it transparent to the user ?
@nchammas haha you're biased but you're right :smile: !
@shivaram I will take a look and let you know, I think that it should be possible. Yep you're right, we should do this :+1: !
@nchammas, @shivaram I pushed the new version with the YAML configuration file format. You can check :) !
@tirami-su you will probably also want to add a note to the REAME, perhaps along with a brief example config file, to show users how this feature works.
@nchammas No prob, I will add a description of this feature with an example.