mrjob mrjob error due to unrecognized option : -ex

I am running on Windows 10 using the latest mrjob version on conda-forge. I am using Hadoop 2.8.0. I have a problem of running ReduceMap job from Mrjob using the hadoop server.

trial.py

from mrjob.job import MRJob
from mrjob.step import MRStep
import re

WORD_RE = re.compile(r"[\w']+")


class Balance(MRJob):

    def mapper(self, _, line):
        teks = line.split(" ")
        nama = teks[0]
        value = int(teks[3])
        if teks[1] in ["withdraw", "transfer-out"]:
            value *= -1
        yield (nama, value)

    def combiner(self, word, counts):
        yield (word, sum(counts))

    def reducer(self, word, counts):
        yield word, sum(counts)

if __name__ == '__main__':
    Balance.run()

Trial.py works by running using mrjob built in MapReduce with datacase from home directory. ie python trial.py input.in

trialrunner.py

from trial import Balance

mr_job = Balance(args=['hdfs:///usr/local/testcase.txt', '-v', '-r', 'hadoop', '--hadoop-streaming-jar=hadoop-streaming-2.8.0.jar'])

#mr_job = Balance(args=['testcase.txt'])

with mr_job.make_runner() as runner:
    runner.run()
    for line in runner.stream_output():
        key, value = mr_job.parse_output_line(line)

I have verified that the testcase.txt is inside the hdfs directory. the hadoop-streaming-2.8.0.jar is in the same directory with the trialrunner.py. I have both hadoop-streaming-2.8.0 jar in both the hadoop directory and local directory but MrJob failed to detect them so I have to add it manually.

However, running the file is unsuccessful

(base) D:\gdp>python trialrunner.py
No configs specified for hadoop runner
Can't fetch history log; missing job ID
No counters found
Can't fetch history log; missing job ID
Can't fetch task logs; missing application ID
Traceback (most recent call last):
  File "trialrunner.py", line 8, in <module>
    runner.run()
  File "C:\Users\User\Anaconda3\lib\site-packages\mrjob\runner.py", line 510, in run
    self._run()
  File "C:\Users\User\Anaconda3\lib\site-packages\mrjob\hadoop.py", line 355, in _run
    self._run_job_in_hadoop()
  File "C:\Users\User\Anaconda3\lib\site-packages\mrjob\hadoop.py", line 485, in _run_job_in_hadoop
    num_steps=self._num_steps())
mrjob.step.StepFailedException: Step 1 of 1 failed: Command '['C:\\hadoop-2.8.0\\bin\\hadoop.CMD', 'jar', 'hadoop-streaming-2.8.0.jar', '-files', 'hdfs:///user/User/tmp/mrjob/trial.User.20190529.040926.484502/files/mrjob.zip#mrjob.zip,hdfs:///user/User/tmp/mrjob/trial.User.20190529.040926.484502/files/setup-wrapper.sh#setup-wrapper.sh,hdfs:///user/User/tmp/mrjob/trial.User.20190529.040926.484502/files/trial.py#trial.py', '-input', 'hdfs:///usr/local/testcase.txt', '-output', 'hdfs:///user/User/tmp/mrjob/trial.User.20190529.040926.484502/output', '-mapper', '/bin/sh -ex setup-wrapper.sh python3 trial.py --step-num=0 --mapper', '-combiner', '/bin/sh -ex setup-wrapper.sh python3 trial.py --step-num=0 --combiner', '-reducer', '/bin/sh -ex setup-wrapper.sh python3 trial.py --step-num=0 --reducer']' returned non-zero exit status 1.

I decided to manually trigger the command above and I got the following error.

(base) D:\gdp>C:\\hadoop-2.8.0\\bin\\hadoop.CMD jar hadoop-streaming-2.8.0.jar -files hdfs:///user/User/tmp/mrjob/trial.User.20190528.113822.387782/files/mrjob.zip#mrjob.zip,hdfs:///user/User/tmp/mrjob/trial.User.20190528.113822.387782/files/setup-wrapper.sh#setup-wrapper.sh,hdfs:///user/User/tmp/mrjob/trial.User.20190528.113822.387782/files/trial.py#trial.py -input hdfs:///usr/local/testcase.txt -output hdfs:///user/User/tmp/mrjob/trial.User.20190528.113822.387782/output -mapper /bin/sh -ex setup-wrapper.sh python3 trial.py --step-num=0 --mapper -combiner /bin/sh -ex setup-wrapper.sh python3 trial.py --step-num=0 --combiner -reducer /bin/sh -ex setup-wrapper.sh python3 trial.py --step-num=0 --reducer

<bunch of warning of illegal factoring>

19/05/29 11:11:33 ERROR streaming.StreamJob: Unrecognized option: -ex
Usage: $HADOOP_PREFIX/bin/hadoop jar hadoop-streaming.jar [options]
Options:
  -input          <path> DFS input file(s) for the Map step.
  -output         <path> DFS output directory for the Reduce step.
  -mapper         <cmd|JavaClassName> Optional. Command to be run as mapper.
  -combiner       <cmd|JavaClassName> Optional. Command to be run as combiner.
  -reducer        <cmd|JavaClassName> Optional. Command to be run as reducer.
  -file           <file> Optional. File/dir to be shipped in the Job jar file.
                  Deprecated. Use generic option "-files" instead.
  -inputformat    <TextInputFormat(default)|SequenceFileAsTextInputFormat|JavaClassName>
                  Optional. The input format class.
  -outputformat   <TextOutputFormat(default)|JavaClassName>
                  Optional. The output format class.
  -partitioner    <JavaClassName>  Optional. The partitioner class.
  -numReduceTasks <num> Optional. Number of reduce tasks.
  -inputreader    <spec> Optional. Input recordreader spec.
  -cmdenv         <n>=<v> Optional. Pass env.var to streaming commands.
  -mapdebug       <cmd> Optional. To run this script when a map task fails.
  -reducedebug    <cmd> Optional. To run this script when a reduce task fails.
  -io             <identifier> Optional. Format to use for input to and output
                  from mapper/reducer commands
  -lazyOutput     Optional. Lazily create Output.
  -background     Optional. Submit the job and don't wait till it completes.
  -verbose        Optional. Print verbose output.
  -info           Optional. Print detailed usage.
  -help           Optional. Print help message.

Generic options supported are
-conf <configuration file>     specify an application configuration file
-D <property=value>            use value for given property
-fs <file:///|hdfs://namenode:port> specify default filesystem URL to use, overrides 'fs.defaultFS' property from configurations.
-jt <local|resourcemanager:port>    specify a ResourceManager
-files <comma separated list of files>    specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars>    specify comma separated jar files to include in the classpath.
-archives <comma separated list of archives>    specify comma separated archives to be unarchived on the compute machines.

The general command line syntax is
command [genericOptions] [commandOptions]


For more details about these options:
Use $HADOOP_PREFIX/bin/hadoop jar hadoop-streaming.jar -info

Try -help for more information
Streaming Command Failed!

May 29 '19 04:05 KerenzaDoxolodeo

I forgot to mention that I downloaded the jar from mvnrepository.com, if it helps.

Jun 10 '19 02:06 KerenzaDoxolodeo

I just tested with streaming jar 3.1.2 from Apache and Maven on Ubuntu. Both returns the same error.

Jun 10 '19 06:06 KerenzaDoxolodeo

I have the same problem with Mrjob and Hadoop 2.8.0. the program works fine locally (on Python), but raises the following error when running on Hadoop. PLEASE HELP!

""" C:\hadoop\bin>python C:/MapReduce.py -r hadoop --hadoop-streaming-jar C:/hadoop-streaming.jar C:/words.txt No configs found; falling back on auto-configuration No configs specified for hadoop runner Looking for hadoop binary in C:\hadoop\bin\bin... Found hadoop binary: .\hadoop.CMD Using Hadoop version 2.8.0 Creating temp directory C:\Users\I\AppData\Local\Temp\MapReduce.I.20220226.181157.856795 uploading working dir files to hdfs:///user/I/tmp/mrjob/MapReduce.I.20220226.181157.856795/files/wd... Copying other local files to hdfs:///user/I/tmp/mrjob/MapReduce.I.20220226.181157.856795/files/ Running step 1 of 1... Found 2 unexpected arguments on the command line [hdfs:///user/I/tmp/mrjob/MapReduce.I.20220226.181157.856795/files/wd/mrjob.zip#mrjob.zip, hdfs:///user/I/tmp/mrjob/MapReduce.I.20220226.181157.856795/files/wd/setup-wrapper.sh#setup-wrapper.sh] Try -help for more information Streaming Command Failed! Attempting to fetch counters from logs... Can't fetch history log; missing job ID No counters found Scanning logs for probable cause of failure... Can't fetch history log; missing job ID Can't fetch task logs; missing application ID Step 1 of 1 failed: Command '['.\hadoop.CMD', 'jar', 'C:/hadoop-streaming.jar', '-files', 'hdfs:///user/I/tmp/mrjob/MapReduce.I.20220226.181157.856795/files/wd/MapReduce.py#MapReduce.py,hdfs:///user/I/tmp/mrjob/MapReduce.I.20220226.181157.856795/files/wd/mrjob.zip#mrjob.zip,hdfs:///user/I/tmp/mrjob/MapReduce.I.20220226.181157.856795/files/wd/setup-wrapper.sh#setup-wrapper.sh', '-input', 'hdfs:///user/I/tmp/mrjob/MapReduce.I.20220226.181157.856795/files/words.txt', '-output', 'hdfs:///user/I/tmp/mrjob/MapReduce.I.20220226.181157.856795/output', '-mapper', '/bin/sh -ex setup-wrapper.sh python3 MapReduce.py --step-num=0 --mapper', '-reducer', '/bin/sh -ex setup-wrapper.sh python3 MapReduce.py --step-num=0 --reducer']' returned non-zero exit status 1. """

Feb 26 '22 18:02 amitash1

mrjob mrjob copied to clipboard

mrjob error due to unrecognized option : -ex

mrjob
mrjob copied to clipboard