mrjob
mrjob copied to clipboard
Run MapReduce jobs on Hadoop or Amazon Web Services
Sometimes mrjob is being run in an environment where only the `StepFailedException` gets through. It might be helpful to be able to tag `StepFailedException` with arbitrary information (e.g. `cluster_id`).
I am trying to run sample mrjob in EMR cluster. I have created EMR cluster manually in AWS dashboard and started mrjob as follows python keywords.py -r emr s3://commoncrawl/crawl-data/CC-MAIN-2018-34/wet.paths.gz --cluster-id...
While updating mrjob to support custom AMIs (#1805) is a good start, it's still significant work to roll your own AMI and keep it up-to-date. Instead, mrjob should look to...
mrjob has historically mocked out various AWS services. Currently this code lives in `tests/mock_boto3`. The [moto](https://github.com/spulec/moto) library does basically the same thing. mrjob should probably try to move to moto,...
If you pass Hadoop a directory as input, it reads all non-"hidden" files (files whose names don't start with `_` or `.`) in that directory, but doesn't recurse into subdirectories...
It's undocumented, but it's valid to write command line options that aren't passthrough. See my comment on #198 to implement an `--add-libjar` option.
When people use the same job flow for several jobs, they like to be able to just leave the same SSH tunnel open. Currently, ssh tunnels are tied to runners,...
This may be able to be solved in another manner, but was wondering if it would make sense to include a connect and read timeout parameter into the mrjob.conf since...
Would be nice to have a way to run a script on the master node before running our job. Example applications: - copying jars to the local filesystem to support...
Looks like we should be able to automatically [create key pairs through the EC2 API](http://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_CreateKeyPair.html) so that SSH will always work. Some things to consider: - should be a way...