agiledata Error Running 'mongo.pig'

08-12 15:58:30,006 [main] ERROR org.apache.pig.tools.grunt.Grunt - java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.JobContext, but interface was expected

vagrant@vagrant-ubuntu-trusty-64:~/agiledata/book-code/ch03/pig$ pig -l /tmp -x local -v -w mongo.pig 2014-08-12 15:58:25,838 [main] INFO org.apache.pig.Main - Apache Pig version 0.12.1-SNAPSHOT (r: unknown) compiled Aug 10 2014, 13:09:02 2014-08-12 15:58:25,845 [main] INFO org.apache.pig.Main - Logging error messages to: /tmp/pig_1407859105741.log 2014-08-12 15:58:26,885 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /home/vagrant/.pigbootup not found 2014-08-12 15:58:26,949 [main] INFO org.apache.pig.tools.parameters.PreprocessorContext - Executing command : echo $HOME\mongo-hadoop
2014-08-12 15:58:27,200 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:/// 2014-08-12 15:58:29,581 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: UNKNOWN 2014-08-12 15:58:29,729 [main] INFO org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, DuplicateForEachColumnRewrite, GroupByConstParallelSetter, ImplicitSplitInserter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, NewPartitionFilterOptimizer, PartitionFilterOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter], RULES_DISABLED=[FilterLogicExpressionSimplifier]} 2014-08-12 15:58:29,812 [main] INFO com.mongodb.hadoop.pig.MongoStorage - checking schema from:chararray,to:chararray,total:long 2014-08-12 15:58:29,817 [main] INFO com.mongodb.hadoop.pig.MongoStorage - Store Location Config: Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml For URI: mongodb://localhost/agile_data.sent_counts 2014-08-12 15:58:30,003 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2998: Unhandled internal error. Found class org.apache.hadoop.mapreduce.JobContext, but interface was expected 2014-08-12 15:58:30,006 [main] ERROR org.apache.pig.tools.grunt.Grunt - java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.JobContext, but interface was expected at com.mongodb.hadoop.MongoOutputFormat.checkOutputSpecs(MongoOutputFormat.java:35) at org.apache.pig.newplan.logical.rules.InputOutputFileValidator$InputOutputFileVisitor.visit(InputOutputFileValidator.java:80) at org.apache.pig.newplan.logical.relational.LOStore.accept(LOStore.java:66) at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:64) at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66) at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66) at org.apache.pig.newplan.DepthFirstWalker.walk(DepthFirstWalker.java:53) at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52) at org.apache.pig.newplan.logical.rules.InputOutputFileValidator.validate(InputOutputFileValidator.java:45) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:303) at org.apache.pig.PigServer.compilePp(PigServer.java:1380) at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1305) at org.apache.pig.PigServer.execute(PigServer.java:1297) at org.apache.pig.PigServer.executeBatch(PigServer.java:375) at org.apache.pig.PigServer.executeBatch(PigServer.java:353) at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:140) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:202) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:173) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84) at org.apache.pig.Main.run(Main.java:607) at org.apache.pig.Main.main(Main.java:156)

Details also at logfile: /tmp/pig_1407859105741.log

Aug 12 '14 17:08 dbl001

Mea Culpa!

I installed pig 0.12.1

vagrant@vagrant-ubuntu-trusty-64:~/agiledata/software/pig/bin$ which pig /home/vagrant/pig-0.12.0/bin/pig vagrant@vagrant-ubuntu-trusty-64:~/agiledata/software/pig/bin$ pig -version Apache Pig version 0.12.1-SNAPSHOT (r: unknown) compiled Aug 10 2014, 13:09:02

I installed other components. Maybe I should reprovision the box.

Aug 12 '14 19:08 dbl001

I destroyed and recreated the vagrant box.

messing lepl (I had to do sudo pip install lepl,
avro 1.53 referenced in pig script, however, version 1.73 is installed. Here's a list of some of the issues:
REGISTER jar paths in sent_counts.pig file don't match file paths.

/* Set Home Directory - where we install software */ %default HOME echo \$HOME/Software/

REGISTER $HOME/pig/build/ivy/lib/Pig/avro-1.5.3.jar REGISTER $HOME/pig/build/ivy/lib/Pig/json-simple-1.1.jar REGISTER $HOME/pig/contrib/piggybank/java/piggybank.jar

vagrant@vagrant-ubuntu-trusty-64:~/agiledata/software$ find . -name 'avro.jar' -ls 269088 428 -rwxr--r-- 1 vagrant vagrant 436302 Aug 12 20:50 ./lib/avro-1.7.7.jar vagrant@vagrant-ubuntu-trusty-64:~/agiledata/software$ find . -name json-simple-1.1.jar -ls vagrant@vagrant-ubuntu-trusty-64:~/agiledata/software$ find . -name 'json-simple.jar' -ls 269089 24 -rwxr--r-- 1 vagrant vagrant 23737 Aug 12 20:50 ./lib/json-simple-1.1.1.jar vagrant@vagrant-ubuntu-trusty-64:~/agiledata/software$ find . -name piggybank.jar -ls 274796 0 lrwxrwxrwx 1 vagrant vagrant 80 Aug 13 15:47 ./lib/piggybank.jar -> /home/vagrant/agiledata/software/pig-0.11.1/contrib/piggybank/java/piggybank.jar 270150 344 -rwxr--r-- 1 vagrant vagrant 351457 Mar 22 2013 ./pig-0.11.1/contrib/piggybank/java/piggybank.jar

Aug 13 '14 16:08 dbl001

Path and version discrepancies:

cat mongo.pig /* Set Home Directory - where we install software */ %default HOME echo \$HOME/Software/

REGISTER $HOME/mongo-hadoop/mongo-2.10.1.jar REGISTER $HOME/mongo-hadoop/core/target/mongo-hadoop-core-1.1.0-SNAPSHOT.jar REGISTER $HOME/mongo-hadoop/pig/target/mongo-hadoop-pig-1.1.0-SNAPSHOT.jar

Files:

269090 580 -rwxr--r-- 1 vagrant vagrant 590996 Aug 12 20:50 /home/vagrant/agiledata/software/lib/mongo-java-driver-2.12.3.jar 656099 712 -rwxr--r-- 1 vagrant vagrant 725577 Aug 12 22:26 /home/vagrant/agiledata/software/mongo-hadoop/pig/build/libs/mongo-hadoop-pig-1.4.0-SNAPSHOT.jar 525303 104 -rwxr--r-- 1 vagrant vagrant 106206 Aug 12 22:20 /home/vagrant/agiledata/software/mongo-hadoop/core/build/libs/mongo-hadoop-core-1.4.0-SNAPSHOT.jar

Aug 13 '14 18:08 dbl001

In grunt, the error happens on this line:

vagrant@vagrant-ubuntu-trusty-64:~/agiledata/book-code/ch03/pig$ pig -x local 2014-08-14 22:44:40,741 [main] INFO org.apache.pig.Main - Apache Pig version 0.11.1 (r1459641) compiled Mar 22 2013, 02:13:53 2014-08-14 22:44:40,745 [main] INFO org.apache.pig.Main - Logging error messages to: /home/vagrant/agiledata/book-code/ch03/pig/pig_1408056280683.log 2014-08-14 22:44:40,852 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /home/vagrant/.pigbootup not found 2014-08-14 22:44:41,420 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:/// grunt> REGISTER /home/vagrant/agiledata/software/lib/mongo-java-driver-2.12.3.jar grunt> REGISTER /home/vagrant/agiledata/software/mongo-hadoop/core/build/libs/mongo-hadoop-core-1.4.0-SNAPSHOT.jar grunt> REGISTER /home/vagrant/agiledata/software/mongo-hadoop/pig/build/libs/mongo-hadoop-pig-1.4.0-SNAPSHOT.jar grunt> grunt> set mapred.map.tasks.speculative.execution false grunt> grunt> set mapred.reduce.tasks.speculative.execution false grunt> sent_counts = LOAD '/tmp/sent_counts.txt' AS (from:chararray, to:chararray, total:long); grunt> STORE sent_counts INTO 'mongodb://localhost/agile_data.sent_counts' USING com.mongodb.hadoop.pig.MongoStorage(); 2014-08-14 22:46:13,466 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: UNKNOWN 2014-08-14 22:46:13,644 [main] INFO com.mongodb.hadoop.pig.MongoStorage - checking schema from:chararray,to:chararray,total:long 2014-08-14 22:46:13,647 [main] INFO com.mongodb.hadoop.pig.MongoStorage - Store Location Config: Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml For URI: mongodb://localhost/agile_data.sent_counts 2014-08-14 22:46:13,778 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2998: Unhandled internal error. Found class org.apache.hadoop.mapreduce.JobContext, but interface was expected Details at logfile: /home/vagrant/agiledata/book-code/ch03/pig/pig_1408056280683.log grunt> ...

Aug 14 '14 22:08 dbl001

Mongodb appears to be fine:

vagrant@vagrant-ubuntu-trusty-64:~/agiledata/software/mongodb-linux-x86_64-2.6.3/bin$ ./mongo MongoDB shell version: 2.6.3 connecting to: test

help db.help() help on db methods db.mycoll.help() help on collection methods sh.help() sharding helpers rs.help() replica set helpers help admin administrative help help connect connecting to a db help help keys key shortcuts help misc misc things to know help mr mapreduce

show dbs                     show database names
show collections             show collections in current database
show users                   show users in current database
show profile                 show most recent system.profile entries with time >= 1ms
show logs                    show the accessible logger names
show log [name]              prints out the last segment of log in memory, 'global' is default
use <db_name>                set current database
db.foo.find()                list objects in collection foo
db.foo.find( { a : 1 } )     list objects in foo where a == 1
it                           result of the last line evaluated; use to further iterate
DBQuery.shellBatchSize = x   set default number of items to display on shell
exit                         quit the mongo shell

use agile_data switched to db agile_data e = {from: '[email protected]', to: '[email protected]', subject: 'Grass seed', body: 'Put grass on the lawn...') 2014-08-14T22:51:02.543+0000 SyntaxError: Unexpected token ) db.email.save(e) 2014-08-14T22:51:10.553+0000 ReferenceError: e is not defined e 2014-08-14T22:51:15.565+0000 ReferenceError: e is not defined e = {from: '[email protected]', to: '[email protected]', subject: 'Grass seed', body: 'Put grass on the lawn...') 2014-08-14T22:51:30.504+0000 SyntaxError: Unexpected token ) e = {from: '[email protected]', to: '[email protected]', subject: 'Grass seed', body: 'Put grass on the lawn...'} { "from" : "[email protected]", "to" : "[email protected]", "subject" : "Grass seed", "body" : "Put grass on the lawn..." } db.email.save(e) WriteResult({ "nInserted" : 1 }) db.email.find() { "_id" : ObjectId("53ed3d872946d90a123dce10"), "from" : "[email protected]", "to" : "[email protected]", "subject" : "Grass seed", "body" : "Put grass on the lawn..." }

Aug 14 '14 22:08 dbl001

Yikes, I'm having a heck of a work week, and am just now going through personal emails for the past few days. If you've figured stuff out and want to send a pull request, I'll gladly accept. It's probably going to be the weekend at the earliest before I can really look at anything.

On Thu, Aug 14, 2014 at 5:58 PM, dbl001 [email protected] wrote:

Mongodb appears to be fine:

vagrant@vagrant-ubuntu-trusty-64:~/agiledata/software/mongodb-linux-x86_64-2.6.3/bin$ ./mongo MongoDB shell version: 2.6.3 connecting to: test

help db.help() help on db methods db.mycoll.help() help on collection methods sh.help() sharding helpers rs.help() replica set helpers help admin administrative help help connect connecting to a db help help keys key shortcuts help misc misc things to know help mr mapreduce

show dbs show database names show collections show collections in current database show users show users in current database show profile show most recent system.profile entries with time >= 1ms show logs show the accessible logger names show log [name] prints out the last segment of log in memory, 'global' is default use <db_name> set current database db.foo.find() list objects in collection foo db.foo.find( { a : 1 } ) list objects in foo where a == 1 it result of the last line evaluated; use to further iterate DBQuery.shellBatchSize = x set default number of items to display on shell exit quit the mongo shell

use agile_data switched to db agile_data e = {from: '[email protected]', to: '[email protected]', subject: 'Grass seed', body: 'Put grass on the lawn...') 2014-08-14T22:51:02.543+0000 SyntaxError: Unexpected token ) db.email.save(e) 2014-08-14T22:51:10.553+0000 ReferenceError: e is not defined e 2014-08-14T22:51:15.565+0000 ReferenceError: e is not defined e = {from: '[email protected]', to: '[email protected]', subject: 'Grass seed', body: 'Put grass on the lawn...') 2014-08-14T22:51:30.504+0000 SyntaxError: Unexpected token ) e = {from: '[email protected]', to: '[email protected]', subject: 'Grass seed', body: 'Put grass on the lawn...'} { "from" : "[email protected]", "to" : "[email protected]", "subject" : "Grass seed", "body" : "Put grass on the lawn..." } db.email.save(e) WriteResult({ "nInserted" : 1 }) db.email.find() { "_id" : ObjectId("53ed3d872946d90a123dce10"), "from" : " [email protected]", "to" : "[email protected]", "subject" : "Grass seed", "body" : "Put grass on the lawn..." }

— Reply to this email directly or view it on GitHub https://github.com/charlesflynn/agiledata/issues/14#issuecomment-52256031 .

Aug 14 '14 23:08 charlesflynn

It’s a work in progress. I tried to get things working on my Mac when I got stuck in the Vagrant box. I’m stuck on the Mac with what looks like lots of version incompatibilities.

Any idea what this means?

grunt> avros = load '$avros' using AvroStorage();
2014-08-14 23:50:09,478 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: Pig script failed to parse: <line 3, column 8> pig script failed to validate: java.lang.RuntimeException: could not instantiate 'org.apache.pig.piggybank.storage.avro.AvroStorage' with arguments 'null' Details at logfile: /home/vagrant/agiledata/book-code/ch05/pig_1408060042628.log

THANKS FOR CHECKING IN!!!

On Aug 14, 2014, at 5:52 PM, Charles Flynn [email protected] wrote:

Yikes, I'm having a heck of a work week, and am just now going through personal emails for the past few days. If you've figured stuff out and want to send a pull request, I'll gladly accept. It's probably going to be the weekend at the earliest before I can really look at anything.

On Thu, Aug 14, 2014 at 5:58 PM, dbl001 [email protected] wrote:

Mongodb appears to be fine:

vagrant@vagrant-ubuntu-trusty-64:~/agiledata/software/mongodb-linux-x86_64-2.6.3/bin$ ./mongo MongoDB shell version: 2.6.3 connecting to: test

help db.help() help on db methods db.mycoll.help() help on collection methods sh.help() sharding helpers rs.help() replica set helpers help admin administrative help help connect connecting to a db help help keys key shortcuts help misc misc things to know help mr mapreduce

show dbs show database names show collections show collections in current database show users show users in current database show profile show most recent system.profile entries with time >= 1ms show logs show the accessible logger names show log [name] prints out the last segment of log in memory, 'global' is default use <db_name> set current database db.foo.find() list objects in collection foo db.foo.find( { a : 1 } ) list objects in foo where a == 1 it result of the last line evaluated; use to further iterate DBQuery.shellBatchSize = x set default number of items to display on shell exit quit the mongo shell

use agile_data switched to db agile_data e = {from: '[email protected]', to: '[email protected]', subject: 'Grass seed', body: 'Put grass on the lawn...') 2014-08-14T22:51:02.543+0000 SyntaxError: Unexpected token ) db.email.save(e) 2014-08-14T22:51:10.553+0000 ReferenceError: e is not defined e 2014-08-14T22:51:15.565+0000 ReferenceError: e is not defined e = {from: '[email protected]', to: '[email protected]', subject: 'Grass seed', body: 'Put grass on the lawn...') 2014-08-14T22:51:30.504+0000 SyntaxError: Unexpected token ) e = {from: '[email protected]', to: '[email protected]', subject: 'Grass seed', body: 'Put grass on the lawn...'} { "from" : "[email protected]", "to" : "[email protected]", "subject" : "Grass seed", "body" : "Put grass on the lawn..." } db.email.save(e) WriteResult({ "nInserted" : 1 }) db.email.find() { "_id" : ObjectId("53ed3d872946d90a123dce10"), "from" : " [email protected]", "to" : "[email protected]", "subject" : "Grass seed", "body" : "Put grass on the lawn..." }

— Reply to this email directly or view it on GitHub https://github.com/charlesflynn/agiledata/issues/14#issuecomment-52256031 .

— Reply to this email directly or view it on GitHub.

Aug 14 '14 23:08 dbl001

Hmm is the script still referring to the older Avro version? Might be that if so.

Also remember that most of the pig scripts are from Russell Jurney's repo, and he may or may not want to update versions. What you can do is update those to match versions until everything is working. Then you can send him a pull request, or just paste the diff output into a ticket for me and I'll do it.

On Thu, Aug 14, 2014 at 6:58 PM, dbl001 [email protected] wrote:

It’s a work in progress. I tried to get things working on my Mac when I got stuck in the Vagrant box. I’m stuck on the Mac with what looks like lots of version incompatibilities.

Any idea what this means?

grunt> avros = load '$avros' using AvroStorage(); 2014-08-14 23:50:09,478 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: Pig script failed to parse: <line 3, column 8> pig script failed to validate: java.lang.RuntimeException: could not instantiate 'org.apache.pig.piggybank.storage.avro.AvroStorage' with arguments 'null' Details at logfile: /home/vagrant/agiledata/book-code/ch05/pig_1408060042628.log

THANKS FOR CHECKING IN!!!

On Aug 14, 2014, at 5:52 PM, Charles Flynn [email protected] wrote:

Yikes, I'm having a heck of a work week, and am just now going through personal emails for the past few days. If you've figured stuff out and want to send a pull request, I'll gladly accept. It's probably going to be the weekend at the earliest before I can really look at anything.

On Thu, Aug 14, 2014 at 5:58 PM, dbl001 [email protected] wrote:

Mongodb appears to be fine:

vagrant@vagrant-ubuntu-trusty-64:~/agiledata/software/mongodb-linux-x86_64-2.6.3/bin$

./mongo MongoDB shell version: 2.6.3 connecting to: test

help db.help() help on db methods db.mycoll.help() help on collection methods sh.help() sharding helpers rs.help() replica set helpers help admin administrative help help connect connecting to a db help help keys key shortcuts help misc misc things to know help mr mapreduce

show dbs show database names show collections show collections in current database show users show users in current database show profile show most recent system.profile entries with time >= 1ms show logs show the accessible logger names show log [name] prints out the last segment of log in memory, 'global' is default use <db_name> set current database db.foo.find() list objects in collection foo db.foo.find( { a : 1 } ) list objects in foo where a == 1 it result of the last line evaluated; use to further iterate DBQuery.shellBatchSize = x set default number of items to display on shell exit quit the mongo shell

use agile_data switched to db agile_data e = {from: '[email protected]', to: '[email protected]', subject: 'Grass seed', body: 'Put grass on the lawn...') 2014-08-14T22:51:02.543+0000 SyntaxError: Unexpected token ) db.email.save(e) 2014-08-14T22:51:10.553+0000 ReferenceError: e is not defined e 2014-08-14T22:51:15.565+0000 ReferenceError: e is not defined e = {from: '[email protected]', to: '[email protected]', subject: 'Grass seed', body: 'Put grass on the lawn...') 2014-08-14T22:51:30.504+0000 SyntaxError: Unexpected token ) e = {from: '[email protected]', to: '[email protected]', subject: 'Grass seed', body: 'Put grass on the lawn...'} { "from" : "[email protected]", "to" : "[email protected]", "subject" : "Grass seed", "body" : "Put grass on the lawn..." } db.email.save(e) WriteResult({ "nInserted" : 1 }) db.email.find() { "_id" : ObjectId("53ed3d872946d90a123dce10"), "from" : " [email protected]", "to" : "[email protected]", "subject" : "Grass seed", "body" : "Put grass on the lawn..." }

— Reply to this email directly or view it on GitHub < https://github.com/charlesflynn/agiledata/issues/14#issuecomment-52256031>

.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/charlesflynn/agiledata/issues/14#issuecomment-52260306 .

Aug 15 '14 00:08 charlesflynn

This is a chapter 5 example.

(Is this the older Avro version? If yes, what’s the correct version?)

vagrant@vagrant-ubuntu-trusty-64:~/agiledata/book-code/ch05$ cat avro_to_mongo.pig /* Set Home Directory - where we install software */ %default HOME echo \$HOME/Software/

/* Load Avro jars and define shortcut */ REGISTER $HOME/pig/build/ivy/lib/Pig/avro-1.5.3.jar REGISTER $HOME/pig/build/ivy/lib/Pig/json-simple-1.1.jar REGISTER $HOME/pig/contrib/piggybank/java/piggybank.jar define AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage();

/* MongoDB libraries and configuration */ REGISTER $HOME/mongo-hadoop/mongo-2.10.1.jar REGISTER $HOME/mongo-hadoop/core/target/mongo-hadoop-core-1.1.0-SNAPSHOT.jar REGISTER $HOME/mongo-hadoop/pig/target/mongo-hadoop-pig-1.1.0-SNAPSHOT.jar

set mapred.map.tasks.speculative.execution false set mapred.reduce.tasks.speculative.execution false

/* Set speculative execution off so we don't have the chance of duplicate records in Mongo / set mapred.map.tasks.speculative.execution false set mapred.reduce.tasks.speculative.execution false define MongoStorage com.mongodb.hadoop.pig.MongoStorage(); / Shortcut */

avros = load '$avros' using AvroStorage(); /* For example, 'enron.avro' / store avros into '$mongourl' using MongoStorage(); / For example, 'mongodb://localhost/enron.emails' */ vagrant@vagrant-ubuntu-trusty-64:~/agiledata/book-code/ch05$

Chapter 3 has other issues getting data into Mongodb. This can certainly wait until the weekend …

Also, what’s the correct version of Hadoop with Pig 0.11 (from OSX Mavericks)?
David-Laxers-MacBook-Pro:/ davidlaxer$ hadoop version Hadoop 0.21.0 Subversion https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.21 -r 985326 Compiled by tomwhite on Tue Aug 17 01:02:28 EDT 2010 From source with checksum a1aeb15b4854808d152989ba76f90fac David-Laxers-MacBook-Pro:/ davidlaxer$

(Is Russell around?) On Aug 14, 2014, at 6:03 PM, Charles Flynn [email protected] wrote:

Hmm is the script still referring to the older Avro version? Might be that if so.

Also remember that most of the pig scripts are from Russell Jurney's repo, and he may or may not want to update versions. What you can do is update those to match versions until everything is working. Then you can send him a pull request, or just paste the diff output into a ticket for me and I'll do it.

On Thu, Aug 14, 2014 at 6:58 PM, dbl001 [email protected] wrote:

It’s a work in progress. I tried to get things working on my Mac when I got stuck in the Vagrant box. I’m stuck on the Mac with what looks like lots of version incompatibilities.

Any idea what this means?

grunt> avros = load '$avros' using AvroStorage(); 2014-08-14 23:50:09,478 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: Pig script failed to parse: <line 3, column 8> pig script failed to validate: java.lang.RuntimeException: could not instantiate 'org.apache.pig.piggybank.storage.avro.AvroStorage' with arguments 'null' Details at logfile: /home/vagrant/agiledata/book-code/ch05/pig_1408060042628.log

THANKS FOR CHECKING IN!!!

On Aug 14, 2014, at 5:52 PM, Charles Flynn [email protected] wrote:

Yikes, I'm having a heck of a work week, and am just now going through personal emails for the past few days. If you've figured stuff out and want to send a pull request, I'll gladly accept. It's probably going to be the weekend at the earliest before I can really look at anything.

On Thu, Aug 14, 2014 at 5:58 PM, dbl001 [email protected] wrote:

Mongodb appears to be fine:

vagrant@vagrant-ubuntu-trusty-64:~/agiledata/software/mongodb-linux-x86_64-2.6.3/bin$

./mongo MongoDB shell version: 2.6.3 connecting to: test

help db.help() help on db methods db.mycoll.help() help on collection methods sh.help() sharding helpers rs.help() replica set helpers help admin administrative help help connect connecting to a db help help keys key shortcuts help misc misc things to know help mr mapreduce

show dbs show database names show collections show collections in current database show users show users in current database show profile show most recent system.profile entries with time >= 1ms show logs show the accessible logger names show log [name] prints out the last segment of log in memory, 'global' is default use <db_name> set current database db.foo.find() list objects in collection foo db.foo.find( { a : 1 } ) list objects in foo where a == 1 it result of the last line evaluated; use to further iterate DBQuery.shellBatchSize = x set default number of items to display on shell exit quit the mongo shell

use agile_data switched to db agile_data e = {from: '[email protected]', to: '[email protected]', subject: 'Grass seed', body: 'Put grass on the lawn...') 2014-08-14T22:51:02.543+0000 SyntaxError: Unexpected token ) db.email.save(e) 2014-08-14T22:51:10.553+0000 ReferenceError: e is not defined e 2014-08-14T22:51:15.565+0000 ReferenceError: e is not defined e = {from: '[email protected]', to: '[email protected]', subject: 'Grass seed', body: 'Put grass on the lawn...') 2014-08-14T22:51:30.504+0000 SyntaxError: Unexpected token ) e = {from: '[email protected]', to: '[email protected]', subject: 'Grass seed', body: 'Put grass on the lawn...'} { "from" : "[email protected]", "to" : "[email protected]", "subject" : "Grass seed", "body" : "Put grass on the lawn..." } db.email.save(e) WriteResult({ "nInserted" : 1 }) db.email.find() { "_id" : ObjectId("53ed3d872946d90a123dce10"), "from" : " [email protected]", "to" : "[email protected]", "subject" : "Grass seed", "body" : "Put grass on the lawn..." }

— Reply to this email directly or view it on GitHub < https://github.com/charlesflynn/agiledata/issues/14#issuecomment-52256031>

.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/charlesflynn/agiledata/issues/14#issuecomment-52260306 .

— Reply to this email directly or view it on GitHub.

Aug 15 '14 00:08 dbl001

Compatibility for 0.11.1 is at http://pig.apache.org/docs/r0.11.1/start.html#Pig+Setup

Keep track (or just open tickets) if you need to upgrade other components to get total version harmony.

On Thu, Aug 14, 2014 at 7:12 PM, dbl001 [email protected] wrote:

This is a chapter 5 example.

(Is this the older Avro version? If yes, what’s the correct version?)

vagrant@vagrant-ubuntu-trusty-64:~/agiledata/book-code/ch05$ cat avro_to_mongo.pig /* Set Home Directory - where we install software */ %default HOME echo \$HOME/Software/

/* Load Avro jars and define shortcut */ REGISTER $HOME/pig/build/ivy/lib/Pig/avro-1.5.3.jar REGISTER $HOME/pig/build/ivy/lib/Pig/json-simple-1.1.jar REGISTER $HOME/pig/contrib/piggybank/java/piggybank.jar define AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage();

/* MongoDB libraries and configuration */ REGISTER $HOME/mongo-hadoop/mongo-2.10.1.jar REGISTER $HOME/mongo-hadoop/core/target/mongo-hadoop-core-1.1.0-SNAPSHOT.jar REGISTER $HOME/mongo-hadoop/pig/target/mongo-hadoop-pig-1.1.0-SNAPSHOT.jar

set mapred.map.tasks.speculative.execution false set mapred.reduce.tasks.speculative.execution false

/* Set speculative execution off so we don't have the chance of duplicate records in Mongo / set mapred.map.tasks.speculative.execution false set mapred.reduce.tasks.speculative.execution false define MongoStorage com.mongodb.hadoop.pig.MongoStorage(); / Shortcut */

avros = load '$avros' using AvroStorage(); /* For example, 'enron.avro' / store avros into '$mongourl' using MongoStorage(); / For example, 'mongodb://localhost/enron.emails' */ vagrant@vagrant-ubuntu-trusty-64:~/agiledata/book-code/ch05$

Chapter 3 has other issues getting data into Mongodb. This can certainly wait until the weekend …

Also, what’s the correct version of Hadoop with Pig 0.11 (from OSX Mavericks)? David-Laxers-MacBook-Pro:/ davidlaxer$ hadoop version Hadoop 0.21.0 Subversion https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.21 -r 985326 Compiled by tomwhite on Tue Aug 17 01:02:28 EDT 2010 From source with checksum a1aeb15b4854808d152989ba76f90fac David-Laxers-MacBook-Pro:/ davidlaxer$

(Is Russell around?) On Aug 14, 2014, at 6:03 PM, Charles Flynn [email protected] wrote:

Hmm is the script still referring to the older Avro version? Might be that if so.

Also remember that most of the pig scripts are from Russell Jurney's repo, and he may or may not want to update versions. What you can do is update those to match versions until everything is working. Then you can send him a pull request, or just paste the diff output into a ticket for me and I'll do it.

On Thu, Aug 14, 2014 at 6:58 PM, dbl001 [email protected] wrote:

It’s a work in progress. I tried to get things working on my Mac when I got stuck in the Vagrant box. I’m stuck on the Mac with what looks like lots of version incompatibilities.

Any idea what this means?

grunt> avros = load '$avros' using AvroStorage();

2014-08-14 23:50:09,478 [main] ERROR org.apache.pig.tools.grunt.Grunt

ERROR 1200: Pig script failed to parse: <line 3, column 8> pig script failed to validate: java.lang.RuntimeException: could not instantiate 'org.apache.pig.piggybank.storage.avro.AvroStorage' with arguments 'null' Details at logfile: /home/vagrant/agiledata/book-code/ch05/pig_1408060042628.log

THANKS FOR CHECKING IN!!!

On Aug 14, 2014, at 5:52 PM, Charles Flynn [email protected] wrote:

Yikes, I'm having a heck of a work week, and am just now going through personal emails for the past few days. If you've figured stuff out and want to send a pull request, I'll gladly accept. It's probably going to be the weekend at the earliest before I can really look at anything.

On Thu, Aug 14, 2014 at 5:58 PM, dbl001 [email protected] wrote:

Mongodb appears to be fine:

vagrant@vagrant-ubuntu-trusty-64:~/agiledata/software/mongodb-linux-x86_64-2.6.3/bin$

./mongo MongoDB shell version: 2.6.3 connecting to: test

help db.help() help on db methods db.mycoll.help() help on collection methods sh.help() sharding helpers rs.help() replica set helpers help admin administrative help help connect connecting to a db help help keys key shortcuts help misc misc things to know help mr mapreduce

show dbs show database names show collections show collections in current database show users show users in current database show profile show most recent system.profile entries with time >= 1ms show logs show the accessible logger names show log [name] prints out the last segment of log in memory, 'global' is default use <db_name> set current database db.foo.find() list objects in collection foo db.foo.find( { a : 1 } ) list objects in foo where a == 1 it result of the last line evaluated; use to further iterate DBQuery.shellBatchSize = x set default number of items to display on shell exit quit the mongo shell

use agile_data switched to db agile_data e = {from: '[email protected]', to: '[email protected]',

subject: 'Grass seed', body: 'Put grass on the lawn...') 2014-08-14T22:51:02.543+0000 SyntaxError: Unexpected token ) db.email.save(e) 2014-08-14T22:51:10.553+0000 ReferenceError: e is not defined e 2014-08-14T22:51:15.565+0000 ReferenceError: e is not defined e = {from: '[email protected]', to: '[email protected]',

subject: 'Grass seed', body: 'Put grass on the lawn...') 2014-08-14T22:51:30.504+0000 SyntaxError: Unexpected token ) e = {from: '[email protected]', to: '[email protected]',

subject: 'Grass seed', body: 'Put grass on the lawn...'} { "from" : "[email protected]", "to" : "[email protected]", "subject" : "Grass seed", "body" : "Put grass on the lawn..." } db.email.save(e) WriteResult({ "nInserted" : 1 }) db.email.find() { "_id" : ObjectId("53ed3d872946d90a123dce10"), "from" : " [email protected]", "to" : "[email protected]", "subject" : "Grass seed", "body" : "Put grass on the lawn..." }

— Reply to this email directly or view it on GitHub <

https://github.com/charlesflynn/agiledata/issues/14#issuecomment-52256031>

.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub < https://github.com/charlesflynn/agiledata/issues/14#issuecomment-52260306>

.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/charlesflynn/agiledata/issues/14#issuecomment-52261270 .

Aug 15 '14 00:08 charlesflynn

I don't represent Russell at all, I'm just some guy that read his early-release book and built this vagrant box while going through it :-)

Aug 15 '14 00:08 charlesflynn

Understood! :-) Thanks for you help!

If you do not set HADOOP_HOME, by default Pig will run with the embedded version, currently Hadoop 1.0.0.) I tried unsetting HADOOP_HOME on OSX Mavericks 10.9.4 (so that Pig uses the embedded version). It still get’s an exception. Other’s have reported this issue:

http://stackoverflow.com/questions/15609484/apache-pig-unable-to-run-my-own-pig-jar-and-pig-withouthadoop-jar

I tried (but it didn’t help): $ ant clean jar-withouthadoop -Dhadoopversion=23

~/pig-0.11.1-src/bin/pig -l /tmp -x local -v -w sent_counts.pig 2014-08-14 18:33:25,763 [main] INFO org.apache.pig.Main - Apache Pig version 0.11.2-SNAPSHOT (rUnversioned directory) compiled Aug 14 2014, 15:31:09 2014-08-14 18:33:25,765 [main] INFO org.apache.pig.Main - Logging error messages to: /private/tmp/pig_1408062805742.log

2014-08-14 18:33:26,043 [main] INFO org.apache.hadoop.security.Groups - Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=300000 2014-08-14 18:33:26,132 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /Users/davidlaxer/.pigbootup not found 2014-08-14 18:33:26,191 [main] INFO org.apache.pig.tools.parameters.PreprocessorContext - Executing command : echo $HOME 2014-08-14 18:33:26,572 [main] WARN org.apache.hadoop.conf.Configuration - mapred.used.genericoptionsparser is deprecated. Instead, use mapreduce.client.genericoptionsparser.used 2014-08-14 18:33:26,574 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:/// 2014-08-14 18:33:26,606 [main] WARN org.apache.hadoop.conf.Configuration - fs.default.name is deprecated. Instead, use fs.defaultFS 2014-08-14 18:33:26,607 [main] WARN org.apache.hadoop.conf.Configuration - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address 2014-08-14 18:33:26,645 [main] WARN org.apache.hadoop.conf.Configuration - fs.default.name is deprecated. Instead, use fs.defaultFS 2014-08-14 18:33:26,646 [main] WARN org.apache.hadoop.conf.Configuration - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address 2014-08-14 18:33:26,750 [main] WARN org.apache.hadoop.conf.Configuration - fs.default.name is deprecated. Instead, use fs.defaultFS 2014-08-14 18:33:26,751 [main] WARN org.apache.hadoop.conf.Configuration - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address 2014-08-14 18:33:26,787 [main] WARN org.apache.hadoop.conf.Configuration - fs.default.name is deprecated. Instead, use fs.defaultFS 2014-08-14 18:33:26,788 [main] WARN org.apache.hadoop.conf.Configuration - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address 2014-08-14 18:33:26,851 [main] WARN org.apache.hadoop.conf.Configuration - fs.default.name is deprecated. Instead, use fs.defaultFS 2014-08-14 18:33:26,852 [main] WARN org.apache.hadoop.conf.Configuration - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address 2014-08-14 18:33:26,917 [main] WARN org.apache.hadoop.conf.Configuration - fs.default.name is deprecated. Instead, use fs.defaultFS 2014-08-14 18:33:26,919 [main] WARN org.apache.hadoop.conf.Configuration - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address 2014-08-14 18:33:27,963 [main] WARN org.apache.hadoop.conf.Configuration - fs.default.name is deprecated. Instead, use fs.defaultFS 2014-08-14 18:33:27,966 [main] WARN org.apache.hadoop.conf.Configuration - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address 2014-08-14 18:33:28,035 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Initializing JVM Metrics with processName=JobTracker, sessionId= 2014-08-14 18:33:28,611 [main] WARN org.apache.hadoop.conf.Configuration - fs.default.name is deprecated. Instead, use fs.defaultFS 2014-08-14 18:33:28,613 [main] WARN org.apache.hadoop.conf.Configuration - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address 2014-08-14 18:33:28,620 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized 2014-08-14 18:33:28,785 [main] WARN org.apache.hadoop.conf.Configuration - fs.default.name is deprecated. Instead, use fs.defaultFS 2014-08-14 18:33:28,785 [main] WARN org.apache.hadoop.conf.Configuration - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address 2014-08-14 18:33:28,793 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized 2014-08-14 18:33:28,884 [main] WARN org.apache.hadoop.conf.Configuration - fs.default.name is deprecated. Instead, use fs.defaultFS 2014-08-14 18:33:28,884 [main] WARN org.apache.hadoop.conf.Configuration - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address 2014-08-14 18:33:28,885 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized 2014-08-14 18:33:29,027 [main] WARN org.apache.hadoop.conf.Configuration - fs.default.name is deprecated. Instead, use fs.defaultFS 2014-08-14 18:33:29,027 [main] WARN org.apache.hadoop.conf.Configuration - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address 2014-08-14 18:33:29,028 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized 2014-08-14 18:33:29,085 [main] WARN org.apache.hadoop.conf.Configuration - fs.default.name is deprecated. Instead, use fs.defaultFS 2014-08-14 18:33:29,085 [main] WARN org.apache.hadoop.conf.Configuration - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address 2014-08-14 18:33:29,086 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized 2014-08-14 18:33:29,194 [main] WARN org.apache.hadoop.conf.Configuration - fs.default.name is deprecated. Instead, use fs.defaultFS 2014-08-14 18:33:29,194 [main] WARN org.apache.hadoop.conf.Configuration - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address 2014-08-14 18:33:29,195 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized 2014-08-14 18:33:29,260 [main] WARN org.apache.hadoop.conf.Configuration - fs.default.name is deprecated. Instead, use fs.defaultFS 2014-08-14 18:33:29,260 [main] WARN org.apache.hadoop.conf.Configuration - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address 2014-08-14 18:33:29,261 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized 2014-08-14 18:33:29,346 [main] WARN org.apache.hadoop.conf.Configuration - fs.default.name is deprecated. Instead, use fs.defaultFS 2014-08-14 18:33:29,348 [main] WARN org.apache.hadoop.conf.Configuration - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address 2014-08-14 18:33:29,350 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized 2014-08-14 18:33:29,474 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: GROUP_BY,ORDER_BY,FILTER 2014-08-14 18:33:29,586 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized 2014-08-14 18:33:29,617 [main] INFO org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned for messages: $0, $1, $2, $3, $4, $5, $8, $9, $10 2014-08-14 18:33:29,641 [main] WARN org.apache.hadoop.conf.Configuration - fs.default.name is deprecated. Instead, use fs.defaultFS 2014-08-14 18:33:29,641 [main] WARN org.apache.hadoop.conf.Configuration - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address 2014-08-14 18:33:29,642 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized 2014-08-14 18:33:29,642 [main] WARN org.apache.hadoop.conf.Configuration - mapred.textoutputformat.separator is deprecated. Instead, use mapreduce.output.textoutputformat.separator 2014-08-14 18:33:29,922 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false 2014-08-14 18:33:30,113 [main] WARN org.apache.hadoop.conf.Configuration - fs.default.name is deprecated. Instead, use fs.defaultFS 2014-08-14 18:33:30,113 [main] WARN org.apache.hadoop.conf.Configuration - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address 2014-08-14 18:33:30,123 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.CombinerOptimizer - Choosing to move algebraic foreach to combiner 2014-08-14 18:33:30,203 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 3 2014-08-14 18:33:30,203 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 3 2014-08-14 18:33:30,212 [main] WARN org.apache.hadoop.conf.Configuration - fs.default.name is deprecated. Instead, use fs.defaultFS 2014-08-14 18:33:30,212 [main] WARN org.apache.hadoop.conf.Configuration - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address 2014-08-14 18:33:30,222 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized 2014-08-14 18:33:30,261 [main] WARN org.apache.pig.backend.hadoop20.PigJobControl - falling back to default JobControl (not using hadoop 0.20 ?) java.lang.NoSuchFieldException: runnerState at java.lang.Class.getDeclaredField(Class.java:2057) at org.apache.pig.backend.hadoop20.PigJobControl.(PigJobControl.java:51) at org.apache.pig.backend.hadoop.executionengine.shims.HadoopShims.newJobControl(HadoopShims.java:97) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.compile(JobControlCompiler.java:285) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:177) at org.apache.pig.PigServer.launchPlan(PigServer.java:1264) at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1249) at org.apache.pig.PigServer.execute(PigServer.java:1239) at org.apache.pig.PigServer.executeBatch(PigServer.java:333) at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:137) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:198) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:170) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84) at org.apache.pig.Main.run(Main.java:604) at org.apache.pig.Main.main(Main.java:157) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at org.apache.hadoop.util.RunJar.main(RunJar.java:192) 2014-08-14 18:33:30,287 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized 2014-08-14 18:33:30,287 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job 2014-08-14 18:33:30,357 [main] WARN org.apache.hadoop.conf.Configuration - mapred.job.reduce.markreset.buffer.percent is deprecated. Instead, use mapreduce.reduce.markreset.buffer.percent 2014-08-14 18:33:30,357 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3 2014-08-14 18:33:30,357 [main] WARN org.apache.hadoop.conf.Configuration - mapred.job.reduce.markreset.buffer.percent is deprecated. Instead, use mapreduce.reduce.markreset.buffer.percent 2014-08-14 18:33:30,357 [main] WARN org.apache.hadoop.conf.Configuration - mapred.output.compress is deprecated. Instead, use mapreduce.output.fileoutputformat.compress 2014-08-14 18:33:30,358 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Using reducer estimator: org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator 2014-08-14 18:33:30,360 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=23891952 2014-08-14 18:33:30,361 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting Parallelism to 1 2014-08-14 18:33:30,361 [main] WARN org.apache.hadoop.conf.Configuration - mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces 2014-08-14 18:33:30,429 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job 2014-08-14 18:33:30,451 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Key [pig.schematuple] is false, will not generate code. 2014-08-14 18:33:30,451 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Starting process to move generated code to distributed cacche 2014-08-14 18:33:30,451 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Distributed cache not supported or needed in local mode. Setting key [pig.schematuple.local.dir] with code temp directory: /var/folders/nj/nphdkhyj6s1dttb0pd9zb2wc0000gn/T/1408062810451-0 2014-08-14 18:33:30,695 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized 2014-08-14 18:33:30,700 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2998: Unhandled internal error. org.apache.hadoop.mapred.jobcontrol.JobControl.addJob(Lorg/apache/hadoop/mapred/jobcontrol/Job;)Ljava/lang/String; 2014-08-14 18:33:30,700 [main] ERROR org.apache.pig.tools.grunt.Grunt - java.lang.NoSuchMethodError: org.apache.hadoop.mapred.jobcontrol.JobControl.addJob(Lorg/apache/hadoop/mapred/jobcontrol/Job;)Ljava/lang/String; at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.compile(JobControlCompiler.java:296) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:177) at org.apache.pig.PigServer.launchPlan(PigServer.java:1264) at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1249) at org.apache.pig.PigServer.execute(PigServer.java:1239) at org.apache.pig.PigServer.executeBatch(PigServer.java:333) at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:137) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:198) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:170) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84) at org.apache.pig.Main.run(Main.java:604) at org.apache.pig.Main.main(Main.java:157) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at org.apache.hadoop.util.RunJar.main(RunJar.java:192)

Details also at logfile: /private/tmp/pig_1408062805742.log David-Laxers-MacBook-Pro:pig davidlaxer$ David-Laxers-MacBook-Pro:pig davidlaxer$

On Aug 14, 2014, at 6:32 PM, Charles Flynn [email protected] wrote:

I don't represent Russell at all, I'm just some guy that read his early-release book and built this vagrant box while going through it :-) — Reply to this email directly or view it on GitHub.

Aug 15 '14 00:08 dbl001

Hi Charles,

Did you have a chance to investigate the issues in #14?

Thanks in advance.

Best, -Dave

On Aug 14, 2014, at 6:43 PM, David Laxer [email protected] wrote:

Understood! :-) Thanks for you help!

If you do not set HADOOP_HOME, by default Pig will run with the embedded version, currently Hadoop 1.0.0.) I tried unsetting HADOOP_HOME on OSX Mavericks 10.9.4 (so that Pig uses the embedded version). It still get’s an exception. Other’s have reported this issue:

http://stackoverflow.com/questions/15609484/apache-pig-unable-to-run-my-own-pig-jar-and-pig-withouthadoop-jar

I tried (but it didn’t help): $ ant clean jar-withouthadoop -Dhadoopversion=23

~/pig-0.11.1-src/bin/pig -l /tmp -x local -v -w sent_counts.pig 2014-08-14 18:33:25,763 [main] INFO org.apache.pig.Main - Apache Pig version 0.11.2-SNAPSHOT (rUnversioned directory) compiled Aug 14 2014, 15:31:09 2014-08-14 18:33:25,765 [main] INFO org.apache.pig.Main - Logging error messages to: /private/tmp/pig_1408062805742.log

2014-08-14 18:33:26,043 [main] INFO org.apache.hadoop.security.Groups - Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=300000 2014-08-14 18:33:26,132 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /Users/davidlaxer/.pigbootup not found 2014-08-14 18:33:26,191 [main] INFO org.apache.pig.tools.parameters.PreprocessorContext - Executing command : echo $HOME 2014-08-14 18:33:26,572 [main] WARN org.apache.hadoop.conf.Configuration - mapred.used.genericoptionsparser is deprecated. Instead, use mapreduce.client.genericoptionsparser.used 2014-08-14 18:33:26,574 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:/// 2014-08-14 18:33:26,606 [main] WARN org.apache.hadoop.conf.Configuration - fs.default.name is deprecated. Instead, use fs.defaultFS 2014-08-14 18:33:26,607 [main] WARN org.apache.hadoop.conf.Configuration - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address 2014-08-14 18:33:26,645 [main] WARN org.apache.hadoop.conf.Configuration - fs.default.name is deprecated. Instead, use fs.defaultFS 2014-08-14 18:33:26,646 [main] WARN org.apache.hadoop.conf.Configuration - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address 2014-08-14 18:33:26,750 [main] WARN org.apache.hadoop.conf.Configuration - fs.default.name is deprecated. Instead, use fs.defaultFS 2014-08-14 18:33:26,751 [main] WARN org.apache.hadoop.conf.Configuration - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address 2014-08-14 18:33:26,787 [main] WARN org.apache.hadoop.conf.Configuration - fs.default.name is deprecated. Instead, use fs.defaultFS 2014-08-14 18:33:26,788 [main] WARN org.apache.hadoop.conf.Configuration - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address 2014-08-14 18:33:26,851 [main] WARN org.apache.hadoop.conf.Configuration - fs.default.name is deprecated. Instead, use fs.defaultFS 2014-08-14 18:33:26,852 [main] WARN org.apache.hadoop.conf.Configuration - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address 2014-08-14 18:33:26,917 [main] WARN org.apache.hadoop.conf.Configuration - fs.default.name is deprecated. Instead, use fs.defaultFS 2014-08-14 18:33:26,919 [main] WARN org.apache.hadoop.conf.Configuration - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address 2014-08-14 18:33:27,963 [main] WARN org.apache.hadoop.conf.Configuration - fs.default.name is deprecated. Instead, use fs.defaultFS 2014-08-14 18:33:27,966 [main] WARN org.apache.hadoop.conf.Configuration - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address 2014-08-14 18:33:28,035 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Initializing JVM Metrics with processName=JobTracker, sessionId= 2014-08-14 18:33:28,611 [main] WARN org.apache.hadoop.conf.Configuration - fs.default.name is deprecated. Instead, use fs.defaultFS 2014-08-14 18:33:28,613 [main] WARN org.apache.hadoop.conf.Configuration - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address 2014-08-14 18:33:28,620 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized 2014-08-14 18:33:28,785 [main] WARN org.apache.hadoop.conf.Configuration - fs.default.name is deprecated. Instead, use fs.defaultFS 2014-08-14 18:33:28,785 [main] WARN org.apache.hadoop.conf.Configuration - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address 2014-08-14 18:33:28,793 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized 2014-08-14 18:33:28,884 [main] WARN org.apache.hadoop.conf.Configuration - fs.default.name is deprecated. Instead, use fs.defaultFS 2014-08-14 18:33:28,884 [main] WARN org.apache.hadoop.conf.Configuration - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address 2014-08-14 18:33:28,885 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized 2014-08-14 18:33:29,027 [main] WARN org.apache.hadoop.conf.Configuration - fs.default.name is deprecated. Instead, use fs.defaultFS 2014-08-14 18:33:29,027 [main] WARN org.apache.hadoop.conf.Configuration - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address 2014-08-14 18:33:29,028 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized 2014-08-14 18:33:29,085 [main] WARN org.apache.hadoop.conf.Configuration - fs.default.name is deprecated. Instead, use fs.defaultFS 2014-08-14 18:33:29,085 [main] WARN org.apache.hadoop.conf.Configuration - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address 2014-08-14 18:33:29,086 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized 2014-08-14 18:33:29,194 [main] WARN org.apache.hadoop.conf.Configuration - fs.default.name is deprecated. Instead, use fs.defaultFS 2014-08-14 18:33:29,194 [main] WARN org.apache.hadoop.conf.Configuration - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address 2014-08-14 18:33:29,195 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized 2014-08-14 18:33:29,260 [main] WARN org.apache.hadoop.conf.Configuration - fs.default.name is deprecated. Instead, use fs.defaultFS 2014-08-14 18:33:29,260 [main] WARN org.apache.hadoop.conf.Configuration - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address 2014-08-14 18:33:29,261 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized 2014-08-14 18:33:29,346 [main] WARN org.apache.hadoop.conf.Configuration - fs.default.name is deprecated. Instead, use fs.defaultFS 2014-08-14 18:33:29,348 [main] WARN org.apache.hadoop.conf.Configuration - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address 2014-08-14 18:33:29,350 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized 2014-08-14 18:33:29,474 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: GROUP_BY,ORDER_BY,FILTER 2014-08-14 18:33:29,586 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized 2014-08-14 18:33:29,617 [main] INFO org.apache.pig.newplan.logical.rules.ColumnPruneVisitor - Columns pruned for messages: $0, $1, $2, $3, $4, $5, $8, $9, $10 2014-08-14 18:33:29,641 [main] WARN org.apache.hadoop.conf.Configuration - fs.default.name is deprecated. Instead, use fs.defaultFS 2014-08-14 18:33:29,641 [main] WARN org.apache.hadoop.conf.Configuration - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address 2014-08-14 18:33:29,642 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized 2014-08-14 18:33:29,642 [main] WARN org.apache.hadoop.conf.Configuration - mapred.textoutputformat.separator is deprecated. Instead, use mapreduce.output.textoutputformat.separator 2014-08-14 18:33:29,922 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false 2014-08-14 18:33:30,113 [main] WARN org.apache.hadoop.conf.Configuration - fs.default.name is deprecated. Instead, use fs.defaultFS 2014-08-14 18:33:30,113 [main] WARN org.apache.hadoop.conf.Configuration - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address 2014-08-14 18:33:30,123 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.CombinerOptimizer - Choosing to move algebraic foreach to combiner 2014-08-14 18:33:30,203 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 3 2014-08-14 18:33:30,203 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 3 2014-08-14 18:33:30,212 [main] WARN org.apache.hadoop.conf.Configuration - fs.default.name is deprecated. Instead, use fs.defaultFS 2014-08-14 18:33:30,212 [main] WARN org.apache.hadoop.conf.Configuration - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address 2014-08-14 18:33:30,222 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized 2014-08-14 18:33:30,261 [main] WARN org.apache.pig.backend.hadoop20.PigJobControl - falling back to default JobControl (not using hadoop 0.20 ?) java.lang.NoSuchFieldException: runnerState at java.lang.Class.getDeclaredField(Class.java:2057) at org.apache.pig.backend.hadoop20.PigJobControl.(PigJobControl.java:51) at org.apache.pig.backend.hadoop.executionengine.shims.HadoopShims.newJobControl(HadoopShims.java:97) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.compile(JobControlCompiler.java:285) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:177) at org.apache.pig.PigServer.launchPlan(PigServer.java:1264) at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1249) at org.apache.pig.PigServer.execute(PigServer.java:1239) at org.apache.pig.PigServer.executeBatch(PigServer.java:333) at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:137) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:198) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:170) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84) at org.apache.pig.Main.run(Main.java:604) at org.apache.pig.Main.main(Main.java:157) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at org.apache.hadoop.util.RunJar.main(RunJar.java:192) 2014-08-14 18:33:30,287 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized 2014-08-14 18:33:30,287 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job 2014-08-14 18:33:30,357 [main] WARN org.apache.hadoop.conf.Configuration - mapred.job.reduce.markreset.buffer.percent is deprecated. Instead, use mapreduce.reduce.markreset.buffer.percent 2014-08-14 18:33:30,357 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3 2014-08-14 18:33:30,357 [main] WARN org.apache.hadoop.conf.Configuration - mapred.job.reduce.markreset.buffer.percent is deprecated. Instead, use mapreduce.reduce.markreset.buffer.percent 2014-08-14 18:33:30,357 [main] WARN org.apache.hadoop.conf.Configuration - mapred.output.compress is deprecated. Instead, use mapreduce.output.fileoutputformat.compress 2014-08-14 18:33:30,358 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Using reducer estimator: org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator 2014-08-14 18:33:30,360 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator - BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=23891952 2014-08-14 18:33:30,361 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting Parallelism to 1 2014-08-14 18:33:30,361 [main] WARN org.apache.hadoop.conf.Configuration - mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces 2014-08-14 18:33:30,429 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job 2014-08-14 18:33:30,451 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Key [pig.schematuple] is false, will not generate code. 2014-08-14 18:33:30,451 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Starting process to move generated code to distributed cacche 2014-08-14 18:33:30,451 [main] INFO org.apache.pig.data.SchemaTupleFrontend - Distributed cache not supported or needed in local mode. Setting key [pig.schematuple.local.dir] with code temp directory: /var/folders/nj/nphdkhyj6s1dttb0pd9zb2wc0000gn/T/1408062810451-0 2014-08-14 18:33:30,695 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized 2014-08-14 18:33:30,700 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2998: Unhandled internal error. org.apache.hadoop.mapred.jobcontrol.JobControl.addJob(Lorg/apache/hadoop/mapred/jobcontrol/Job;)Ljava/lang/String; 2014-08-14 18:33:30,700 [main] ERROR org.apache.pig.tools.grunt.Grunt - java.lang.NoSuchMethodError: org.apache.hadoop.mapred.jobcontrol.JobControl.addJob(Lorg/apache/hadoop/mapred/jobcontrol/Job;)Ljava/lang/String; at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.compile(JobControlCompiler.java:296) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:177) at org.apache.pig.PigServer.launchPlan(PigServer.java:1264) at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1249) at org.apache.pig.PigServer.execute(PigServer.java:1239) at org.apache.pig.PigServer.executeBatch(PigServer.java:333) at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:137) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:198) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:170) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84) at org.apache.pig.Main.run(Main.java:604) at org.apache.pig.Main.main(Main.java:157) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at org.apache.hadoop.util.RunJar.main(RunJar.java:192)

Details also at logfile: /private/tmp/pig_1408062805742.log David-Laxers-MacBook-Pro:pig davidlaxer$ David-Laxers-MacBook-Pro:pig davidlaxer$

On Aug 14, 2014, at 6:32 PM, Charles Flynn [email protected] wrote:

I don't represent Russell at all, I'm just some guy that read his early-release book and built this vagrant box while going through it :-) — Reply to this email directly or view it on GitHub.

Aug 31 '14 18:08 dbl001

Sorry for not addressing this sooner. I'm going to prepare a Docker image for you to use with the book.

Oct 08 '14 20:10 rjurney