mrjob
mrjob copied to clipboard
Run MapReduce jobs on Hadoop or Amazon Web Services
AWS's own tags start with `aws:`. We should do the same with mrjob tag names, e.g. `mrjob:version` rather than `__mrjob_version` and `mrjob:pool:name` rather than `__mrjob_pool_name`. Old tag names used by...
Python 3.8 was released last October. Currently, tests do not pass with Python 3.8. The problem seems to be with `pyspark`.
Am I doing something wrong or is the `--conf-path` / `-c` option not supported when trying to run a PySpark script using `mrjob spark-submit`? For example, I have an EMR...
Deprecation warnings are raised due to invalid escape sequences in Python 3.8 . Below is a log of the warnings raised during compiling all the python files. Using raw strings...
Sometimes, it is useful to not wait for an mrjob to complete. An important use case is to spin up an EMR job from e.g. an AWS lambda function. In...
I have a mrjob: ``` from mrjob.job import MRJob from mrjob.step import MRStep from mrjob.protocol import JSONProtocol, RawValueProtocol, PickleProtocol, JSONValueProtocol import pandas class WordCount(MRJob): INPUT_PROTOCOL = RawValueProtocol INTERNAL_PROTOCOL = PickleProtocol...
If you run `python -m mrjob.examples.mr_wc -r dataproc --image-version 1.4`, it fails with: ``` Traceback (most recent call last): File "mr_wc.py", line 18, in from mrjob.job import MRJobImportError: No module...
``` from mrjob.job import MRJob from mrjob.step import MRStep import re import numpy as np WORD_RE = re.compile(r"[\w']+") class WordCount(MRJob): def mapper(self, _, line): for word in WORD_RE.findall(line): yield (word.lower(),...
If a `MRJob`s has a very large number of entries associated with the same reducer key, it can be difficult to run through the Spark runner, because all the entries...