mrjob
mrjob copied to clipboard
Run MapReduce jobs on Hadoop or Amazon Web Services
If you specify an invalid JAR on EMR, it actually fails inside the controller, creating a "controller" log but no syslog or stderr log. I think the logic for the...
The way we patch mrjob.conf in tests is cumbersome and gets underfoot. The point was not to have the user's default mrjob.conf or `$MRJOB_CONF` affect tests. But the framework also...
Just noticed, in https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/cluster-properties that you can allow zero core instances by setting the property `dataproc:dataproc.allow.zero.workers` to `true`. Currently the assumption that at least 2 workers are required is hard-coded...
It looks like this is directly supported by the API; we just have to add our libjar `jarFileUris` in the [HadoopJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs#HadoopJob) field, when we submit a job.
Since we're reading logs from Stackdriver, the cluster no longer has to exist. Targeting a particular step gets a little tricky, though we could filter by application_id.
Currently, we just stream raw lines from the driver output, rather than tagging them with path and line_num. This is a fairly easy fix; just not so important because we...
I'm noticing some issues of my jobs frequently running out of disk on EMR. I'm using an outdated AMI (`2.4.11`) but `mrjob-5.10`. I'm leveraging the great pooling features in `mrjob`...
This is relevant to #754, which I'm in the process of testing. I'd like to use `gsutil` to download input files from Google Cloud Storage to Hadoop nodes, rather than...
If you give EMR an S3 output path that's in another region, your job fails with this error: ``` Exception in thread "main" com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Moved Permanently (Service: Amazon S3; Status...
Protocols should be allowed to have `HADOOP_*_FORMAT` and `JOBCONF` fields, as well as `hadoop_*_format()` and `jobconf()` methods, which supply defaults if something is not already specified for that step. That...