dnload_raw operation requires the ability to `sudo -u hdfs`
The eggo cli tool to download a dataset on a CDH cluster uses Hadoop Streaming basically as a job scheduler to dnload files into HDFS.
The CLI here: https://github.com/bigdatagenomics/eggo/blob/c0e980f6581e85d4687de625af2957906d446c22/eggo/cli/datasets.py#L33
And the code here: https://github.com/bigdatagenomics/eggo/blob/c0e980f6581e85d4687de625af2957906d446c22/eggo/operations.py#L38-L84
The mapper script is here: https://github.com/bigdatagenomics/eggo/blob/c0e980f6581e85d4687de625af2957906d446c22/eggo/resources/download_mapper.py
The user creates a tmp HDFS directory to receive the data: https://github.com/bigdatagenomics/eggo/blob/c0e980f6581e85d4687de625af2957906d446c22/eggo/operations.py#L40
The MR job then runs and downloads the data using curl into that directory, but runs as user "yarn".
After the dataset is dnloaded, we create the final output directory: https://github.com/bigdatagenomics/eggo/blob/c0e980f6581e85d4687de625af2957906d446c22/eggo/operations.py#L78
And ideally we'd just move all the data there. However, all the data is owned by user "yarn", so it causes lots of permissions problems downstream. Instead, we chown all the data here:
https://github.com/bigdatagenomics/eggo/blob/c0e980f6581e85d4687de625af2957906d446c22/eggo/operations.py#L79-L81
(note: this chowns is to user ec2-user, but easy to change to whatever the current user is)
which requires the sudo capability.
Any way around this? cc @tomwhite
There are a couple of options:
- Set dfs.permissions.enabled to false, so that permission checking is disabled.
- Enable the LinuxContainerExecutor so that containers run as the user that submitted the job.
The second is preferable from a security point of view. See the following for more info:
- https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/SecureContainer.html
- http://www.cloudera.com/content/www/en-us/documentation/archive/cdh/4-x/4-3-0/CDH4-Security-Guide/cdh4sg_topic_18_3.html
So anyone that wants to use the dnload tool on their cluster has to set up the LinuxContainerExecutor? Do you think this is an overly-restrictive environment? Should I just not be using MapReduce to dnload the datasets?