eggo dnload_raw operation requires the ability to `sudo -u hdfs`

The eggo cli tool to download a dataset on a CDH cluster uses Hadoop Streaming basically as a job scheduler to dnload files into HDFS.

The CLI here: https://github.com/bigdatagenomics/eggo/blob/c0e980f6581e85d4687de625af2957906d446c22/eggo/cli/datasets.py#L33

And the code here: https://github.com/bigdatagenomics/eggo/blob/c0e980f6581e85d4687de625af2957906d446c22/eggo/operations.py#L38-L84

The mapper script is here: https://github.com/bigdatagenomics/eggo/blob/c0e980f6581e85d4687de625af2957906d446c22/eggo/resources/download_mapper.py

The user creates a tmp HDFS directory to receive the data: https://github.com/bigdatagenomics/eggo/blob/c0e980f6581e85d4687de625af2957906d446c22/eggo/operations.py#L40

The MR job then runs and downloads the data using curl into that directory, but runs as user "yarn".

After the dataset is dnloaded, we create the final output directory: https://github.com/bigdatagenomics/eggo/blob/c0e980f6581e85d4687de625af2957906d446c22/eggo/operations.py#L78

And ideally we'd just move all the data there. However, all the data is owned by user "yarn", so it causes lots of permissions problems downstream. Instead, we chown all the data here: https://github.com/bigdatagenomics/eggo/blob/c0e980f6581e85d4687de625af2957906d446c22/eggo/operations.py#L79-L81 (note: this chowns is to user ec2-user, but easy to change to whatever the current user is) which requires the sudo capability.

Any way around this? cc @tomwhite

Oct 14 '15 17:10 laserson

There are a couple of options:

Set dfs.permissions.enabled to false, so that permission checking is disabled.
Enable the LinuxContainerExecutor so that containers run as the user that submitted the job.

The second is preferable from a security point of view. See the following for more info:

https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/SecureContainer.html
http://www.cloudera.com/content/www/en-us/documentation/archive/cdh/4-x/4-3-0/CDH4-Security-Guide/cdh4sg_topic_18_3.html

Oct 15 '15 13:10 tomwhite

So anyone that wants to use the dnload tool on their cluster has to set up the LinuxContainerExecutor? Do you think this is an overly-restrictive environment? Should I just not be using MapReduce to dnload the datasets?

Oct 21 '15 23:10 laserson