elephant-bird icon indicating copy to clipboard operation
elephant-bird copied to clipboard

SeqeuenceFile with VectorWritable

Open drahcos opened this issue 10 years ago • 15 comments

Hi, I try to extract entries from a tfidf-SequenceFile which I created with seq2sparse. I can read and extract the content but I need to create a new SequenceFile with the entries I extracted. The value needs to be of VectorWritable type (like in seq2sparse tfidf). I tried your SequenceFileStorage with '-c com.twitter.elephantbird.pig.mahout.VectorWritableConverter' as the second parameter but the output always uses the Text class instead. Is there some way to handle this?

Regards, Richard

drahcos avatar May 07 '14 06:05 drahcos

can you post the actual pig statement?

rangadi avatar May 07 '14 22:05 rangadi

STORE newFiles INTO 'new-Vectors' USING SequenceFileStorage('-c com.twitter.elephantbird.pig.util.TextConverter', '-c com.twitter.elephantbird.pig.util.VectorWritableConverter');

drahcos avatar May 08 '14 08:05 drahcos

While using the SequenceFileLoader (I didn't use it before because I can read the entries with seqdumper) I stored some entries with PigStorage to get a sample. I noticed that the loader doesn't get the values but only the keys. It definitely stores two two things since I can see a tab right after the key but the values are empty. And yes, I noticed that I had the wrong path for VectorWritableConverter but after changing it the problem remains.

drahcos avatar May 08 '14 11:05 drahcos

Could you also give us the schema of your newFiles relation?:

DESCRIBE newFiles;

On Thu, May 8, 2014 at 1:23 AM, drahcos [email protected] wrote:

STORE newFiles INTO 'new-Vectors' USING SequenceFileStorage('-c com.twitter.elephantbird.pig.util.TextConverter', '-c com.twitter.elephantbird.pig.util.VectorWritableConverter');

— Reply to this email directly or view it on GitHubhttps://github.com/kevinweil/elephant-bird/issues/389#issuecomment-42524186 .

sagemintblue avatar May 08 '14 15:05 sagemintblue

DESCRIBE newFiles; newFiles: {key: chararray,dbFiles::value: chararray} <- This is without the SequenceFileLoader,

Btw. I changed SequenceFileStorage now to always choose VectorWritable (job.setOutputValueClass(VectorWritable.class); which works, but I need to load it as VectorWritable and I didn't find a way to do this. Could you tell me where the SequenceFileLoader gets this information? I mean the position where I can directly enter "VectorWritable.class" like I did in SequenceFileStorage so I can force it.

drahcos avatar May 08 '14 16:05 drahcos

With the SequenceFile Loader: newFiles: {key: chararray,dbFiles::value: chararray} dbFiles is what I load. newFiles is the result of a join with some keys.

drahcos avatar May 08 '14 16:05 drahcos

Isn't there anything I can do? I really just need to load an tfidf-sequencefile, compare the keys and store some of the entries into a new tfidf-sequencefile. I don't need to manipulate the vectors or anything. Please tell me if there is a way to hardcode it or something else I can do. I'm in real need of this data.

drahcos avatar May 08 '14 20:05 drahcos

If you don't need to do anything in pig with the vector data, please try out GenericWritableConverter:

https://github.com/kevinweil/elephant-bird/blob/master/pig/src/main/java/com/twitter/elephantbird/pig/util/GenericWritableConverter.java

On Thu, May 8, 2014 at 1:52 PM, drahcos [email protected] wrote:

Isn't there anything I can do? I really just need to load an tfidf-sequencefile, compare the keys and store some of the entries into a new tfidf-sequencefile. I don't need to manipulate the vectors or anything. Please tell me if there is a way to hardcode it or something else I can do. I'm in real need of this data.

— Reply to this email directly or view it on GitHubhttps://github.com/kevinweil/elephant-bird/issues/389#issuecomment-42604761 .

sagemintblue avatar May 08 '14 20:05 sagemintblue

I'm sorry but that also didn't work. It seems like SequenceFileLoader doesn't accept my input since I only get the standard Text class. Is this input correct?

dbFiles = LOAD 'ready-Vectors/tfidf-vectors' USING com.twitter.elephantbird.pig.load.SequenceFileLoader('-c com.twitter.elephantbird.pig.util.TextConverter', '-c com.twitter.elephantbird.pig.util.GenericWritableConverter') AS (key: chararray, value);

drahcos avatar May 08 '14 22:05 drahcos

You may be missing REGISTER statements-- All supporting jars must be included in the job, otherwise you'll run into class not found errors at runtime.

On Thu, May 8, 2014 at 3:00 PM, drahcos [email protected] wrote:

I'm sorry but that also didn't work. It seems like SequenceFileLoader doesn't accept my input since I only get the standard Text class. Is this input correct?

dbFiles = LOAD 'ready-Vectors/tfidf-vectors' USING com.twitter.elephantbird.pig.load.SequenceFileLoader('-c com.twitter.elephantbird.pig.util.TextConverter', '-c com.twitter.elephantbird.pig.util.GenericWritableConverter') AS (key: chararray, value);

— Reply to this email directly or view it on GitHubhttps://github.com/kevinweil/elephant-bird/issues/389#issuecomment-42611736 .

sagemintblue avatar May 08 '14 22:05 sagemintblue

I registered them all: REGISTER $ELEPH_LIBS/elephant-bird-core-4.4.jar REGISTER $ELEPH_LIBS/elephant-bird-cascading2-4.4.jar REGISTER $ELEPH_LIBS/elephant-bird-lucene-4.4.jar REGISTER $ELEPH_LIBS/elephant-bird-hive-4.4.jar REGISTER $ELEPH_LIBS/elephant-bird-hadoop-compat-4.4.jar REGISTER $ELEPH_LIBS/elephant-bird-rcfile-4.4.jar REGISTER $ELEPH_LIBS/elephant-bird-mahout-4.4.jar REGISTER $ELEPH_LIBS/elephant-bird-examples-4.4.jar REGISTER $ELEPH_LIBS/elephant-bird-pig-lucene-4.4.jar REGISTER $ELEPH_LIBS/elephant-bird-crunch-4.4.jar

even the ones I don't need. Do you know where in the code SequenceFileLoader sets the classes to load? I already hardcoded VectorWritable for the store function and I know it worked because when I used the hardcoded version I get:

Backend error message

java.io.IOException: java.io.IOException: wrong value class: org.apache.hadoop.io.Text is not class org.apache.mahout.math.VectorWritable at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:470) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.processOnePackageOutput(PigGenericMapReduce.java:433) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:405) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:257) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:164) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:610) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:444) at org.apache.hadoop.mapred.Child$4.run(Child.java:268) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Su

Pig Stack Trace

ERROR 2997: Encountered IOException. java.io.IOException: wrong value class: org.apache.hadoop.io.Text is not class org.apache.mahout.math.VectorWritable

java.io.IOException: java.io.IOException: wrong value class: org.apache.hadoop.io.Text is not class org.apache.mahout.math.VectorWritable at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:470) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.processOnePackageOutput(PigGenericMapReduce.java:433) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:405) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:257) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:164) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:610) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:444) at org.apache.hadoop.mapred.Child$4.run(Child.java:268)

at java.security.AccessController.doPrivileged(Native Method)

When I hardcode it to the Text class I get a perfect sequence file but of course with Text for the value.

drahcos avatar May 08 '14 22:05 drahcos

I think I'm nearing the limit on my ability to help you with this, but let me go over my assumptions here once more:

You have an existing sequence file dataset, whose keys are Text and values VectorWritable.

You'd like to load this data into pig, filter the (key, value) pairs based on keys, then write the remaining (key, value) pairs back out into another sequence file.

You won't touch the values at all, but need to write them through to output.

If these assumptions are correct, you should be able to do this with something resembling the following:

REGISTER '$ELEPH_LIBS/elephant-bird-core-.jar'; REGISTER '$ELEPH_LIBS/elephant-bird-pig-.jar'; REGISTER 'path/to/mahout-math.jar';

%declare SEQFILE_STORAGE 'com.twitter.elephantbird.pig.store.SequenceFileStorage'; %declare SEQFILE_LOADER 'com.twitter.elephantbird.pig.load.SequenceFileLoader'; %declare TEXT_CONVERTER 'com.twitter.elephantbird.pig.util.TextConverter'; %declare GENERIC_CONVERTER 'com.twitter.elephantbird.pig.util.GenericWritableConverter'; %declare VECTOR_WRITABLE 'org.apache.mahout.math.VectorWritable';

-- load existing data, resulting schema is (key: chararray, value: bytearray) entry = LOAD 'seqfile_data' USING $SEQFILE_LOADER( '-c $TEXT_CONVERTER', '-c $GENERIC_CONVERTER' );

-- filter entries entry_filtered = FILTER entry BY key == 'something';

-- store remaining entries into new sequence file STORE entry_filtered INTO 'seqfile_data_filtered' USING $SEQFILE_STORAGE( '-c $TEXT_CONVERTER', '-c $GENERIC_CONVERTER -t $VECTOR_WRITABLE' );

On Thu, May 8, 2014 at 3:17 PM, drahcos [email protected] wrote:

I registered them all: REGISTER $ELEPH_LIBS/elephant-bird-core-4.4.jar REGISTER $ELEPH_LIBS/elephant-bird-cascading2-4.4.jar REGISTER $ELEPH_LIBS/elephant-bird-lucene-4.4.jar REGISTER $ELEPH_LIBS/elephant-bird-hive-4.4.jar REGISTER $ELEPH_LIBS/elephant-bird-hadoop-compat-4.4.jar REGISTER $ELEPH_LIBS/elephant-bird-rcfile-4.4.jar REGISTER $ELEPH_LIBS/elephant-bird-mahout-4.4.jar REGISTER $ELEPH_LIBS/elephant-bird-examples-4.4.jar REGISTER $ELEPH_LIBS/elephant-bird-pig-lucene-4.4.jar REGISTER $ELEPH_LIBS/elephant-bird-crunch-4.4.jar

even the ones I don't need. Do you know where in the code SequenceFileLoader sets the classes to load? I already hardcoded VectorWritable for the store function and I know it worked because when I used the hardcoded version I get: Backend error message

java.io.IOException: java.io.IOException: wrong value class: org.apache.hadoop.io.Text is not class org.apache.mahout.math.VectorWritable at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:470) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.processOnePackageOutput(PigGenericMapReduce.java:433) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:405) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:257) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:164) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:610) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:444) at org.apache.hadoop.mapred.Child$4.run(Child.java:268) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Su Pig Stack Trace

ERROR 2997: Encountered IOException. java.io.IOException: wrong value class: org.apache.hadoop.io.Text is not class org.apache.mahout.math.VectorWritable

java.io.IOException: java.io.IOException: wrong value class: org.apache.hadoop.io.Text is not class org.apache.mahout.math.VectorWritable at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:470) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.processOnePackageOutput(PigGenericMapReduce.java:433) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:405) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:257) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:164) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:610) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:444) at org.apache.hadoop.mapred.Child$4.run(Child.java:268) at java.security.AccessController.doPrivileged(Native Method)

When I hardcode it to the Text class I get a perfect sequence file but of course with Text for the value.

— Reply to this email directly or view it on GitHubhttps://github.com/kevinweil/elephant-bird/issues/389#issuecomment-42613309 .

sagemintblue avatar May 08 '14 22:05 sagemintblue

Oh my god! you don't know how happy I am right now xD. Everything works perfectly! I need this data for my thesis and I spend so much time on things that were actually not part of my work because so much stuff went wrong. Honestly! Thank you!

drahcos avatar May 09 '14 00:05 drahcos

Oh! btw. it needs the mahout-core.jar Thank you again! :D

drahcos avatar May 09 '14 00:05 drahcos

glad that finally things worked fine.. good luck for you thesis. Thanks Andy for helping out.

We need to look into how error message could have been more clear. If a jar is missing, the actual error should be about a missing class. That would have saved much more time.

rangadi avatar May 12 '14 16:05 rangadi