mrjob issues

Tutorial failed. 'worldcount' on Hadoop error.

1

I tried the worldcount on hadoop (EMR on AWS). I get the following error by `python3 job.py -r hadoop hdfs:///user/hadoop/input.txt`: [hadoop@ip-172-31-36-214 wordcount]$ python3 job.py -r hadoop hdfs:///user/hadoop/input.txt No configs found;...

HengyueLi

add get() method to fs

Our fs interface now has `put()` but no `get()`. Would be relatively trivial to add one; just open a file and cat chunks into it.

coyotemarin

Feature

parse logs from Spark jobs on Dataproc

When a Spark job on Dataproc fails, we should be able to parse logs and find cause of error.

coyotemarin

Feature

Dataproc Spark support

1

Should add support for running Spark jobs.

coyotemarin

Feature

implement fs.put() in SSH filesystem

This wouldn't be very difficult; you basically run `ssh ... cat ` and pipe in the file contents. But it's also not especially useful.

coyotemarin

Feature

add spark_context() and spark_session() methods to MRJobs

1

These would save the user from having to import `pyspark`, and could also set up `SparkConf` for you. Probably mostly matters for the inline runner (see #1965).

coyotemarin

Feature

Spark harness should implement commands

For complete support of `MRJob`, it would be helpful for the Spark harness to be able to be able to implement e.g. `mapper_cmd()`, `mapper_pre_filter()`. This isn't actually that difficult to...

coyotemarin

Feature

ops, ssh subprocess exited with return code 255, restarting...

1

I am trying to find keywords from CommonCrawl archive. When I tried to run with one wet.gz file, it works fine. But If I try to run our script with...

aaqibjavith

drop support for Spark 1?

We may want to consider eventually dropping support for Spark 1. At the very least, the spark harness which allows running MRJobs on the spark runner (see #1838) will be...

coyotemarin

Cleanup

stop mocking hadoop with a subprocess, on the real filesystem

2

mockhadoop tests are slow and hard to debug. mockhadoop doesn't support generic options (e.g. `-D`, `-jobconf`).

coyotemarin

Testing

mrjob
mrjob copied to clipboard

Metadata

Tutorial failed. 'worldcount' on Hadoop error.

add get() method to fs

parse logs from Spark jobs on Dataproc

Dataproc Spark support

implement fs.put() in SSH filesystem

add spark_context() and spark_session() methods to MRJobs

Spark harness should implement commands

ops, ssh subprocess exited with return code 255, restarting...

drop support for Spark 1?

stop mocking hadoop with a subprocess, on the real filesystem

← Metadata

Owner

Metadata

mrjob mrjob copied to clipboard

Metadata

← Metadata

Owner

Metadata

mrjob
mrjob copied to clipboard