mrjob
mrjob copied to clipboard
Run MapReduce jobs on Hadoop or Amazon Web Services
I tried the worldcount on hadoop (EMR on AWS). I get the following error by `python3 job.py -r hadoop hdfs:///user/hadoop/input.txt`: [hadoop@ip-172-31-36-214 wordcount]$ python3 job.py -r hadoop hdfs:///user/hadoop/input.txt No configs found;...
Our fs interface now has `put()` but no `get()`. Would be relatively trivial to add one; just open a file and cat chunks into it.
When a Spark job on Dataproc fails, we should be able to parse logs and find cause of error.
This wouldn't be very difficult; you basically run `ssh ... cat ` and pipe in the file contents. But it's also not especially useful.
These would save the user from having to import `pyspark`, and could also set up `SparkConf` for you. Probably mostly matters for the inline runner (see #1965).
For complete support of `MRJob`, it would be helpful for the Spark harness to be able to be able to implement e.g. `mapper_cmd()`, `mapper_pre_filter()`. This isn't actually that difficult to...
I am trying to find keywords from CommonCrawl archive. When I tried to run with one wet.gz file, it works fine. But If I try to run our script with...
We may want to consider eventually dropping support for Spark 1. At the very least, the spark harness which allows running MRJobs on the spark runner (see #1838) will be...
mockhadoop tests are slow and hard to debug. mockhadoop doesn't support generic options (e.g. `-D`, `-jobconf`).