varaha icon indicating copy to clipboard operation
varaha copied to clipboard

Added Stanford tokenizer, sentence splitter, and part of speech tagging to varaha.text

Open rjurney opened this issue 11 years ago • 4 comments

register ../../lib/stanford-postagger-withModel.jar register ../../target/varaha-1.0-SNAPSHOT.jar

reviews = LOAD 'data/ten.avro' USING AvroStorage; foo = FOREACH reviews GENERATE business_id, varaha.text.StanfordTokenize(text) AS tagged; DUMP foo

(41J1FgfIsmsLRCZ3QILG6w,{(truly),(impressive),(facility),(came),(for),(two),(books),(not),(knowing),(this),(location),(-LRB-),(normally),(Appaloosa),(-RRB-),(The),(staff),(was),(very),(helpful),(and),(found),(what),(wanted),(very),(quickly),(was),(there),(minutes),(tops),(would),(highly),(recommend),(this),(Library),(anyone),(interested),('ll),(coming),(back),(very),(soon),(for),(next),(batch)}) (4YX4ZtUqs6xtcc4AdjbpeQ,{(Other),(circle),(are),(much),(cleaner),(than),(this),(one),(The),(best),(thing),(about),(this),(store),(the),(Employees),(are),(friendly),(and),(nice),('ve),(been),(this),(location),(the),(morning),(and),(the),(evening),(and),(there),(must),(point),(where),(the),(shift),(changes),(and),(they),(stop),(cleaning),(the),(bathrooms),(and),(emptying),(the),(trash),(the),(morning),(everything),(clean),(the),(time),(evening),(rolls),(around),(there),(are),(odd),(smells),(all),(over),(the),(store),(shame),(since),(larger),(newer),(looking),(store),(that),(n't),(cleaner),('ll),(back),(hopes),(they),(clean),(little),(more)}) (5kRug3bEienrpovtPRVVwg,{(Went),(with),(husband),(Richardson),(Rokerij),(for),(the),(first),(time),(raved),(about),(this),(place),(went),(Wednesday),(night),(with),(reservation),(The),(wait),(was),(about),(hour),(Luckily),(there),(were),(bar),(seats),(that),(became),(available),(took),(them),(ordered),(the),(cheese),(flatbread),(appetizer),(and),(was),(delicious),(had),(large),(salad),(for),(dinner),(which),(was),(perfect),(was),(not),(very),(hungry),(husband),(had),(the),(chicken),(enchiladas),(that),(tasted),(and),(were),(very),(good),(The),(food),(cooked),(order),(did),(take),(while),(get),(our),(meal),(but),(was),(worth),(the),(wait),(and),(service),(was),(excellent),(While),(waiting),(chatted),(with),(several),(people),(the),(bar),(and),(one),(couple),(offered),(taste),(their),(appetizer),(returned),(the),(favor),(when),(flatbread),(came),(One),(more),(thing),(not),(leave),(without),(getting),(the),(decadent),(truffle),(dessert),(Heavenly),(but),(not),(over),(done),(any),(way),(All),(all),(great),(experience),(recommend),(reservations)})

reviews = LOAD 'data/ten.avro' USING AvroStorage(); reviews = LIMIT reviews 1000; bar = FOREACH reviews GENERATE business_id, FLATTEN(varaha.text.SentenceTokenize(text)) AS tokenized_sentences; bar = FOREACH bar GENERATE business_id, varaha.text.StanfordPOSTagger(tokenized_sentences) AS tagged; DUMP bar

(6VRbbNQe5ouWmwsMebUMkg,{(My,PRP$),(friend,NN),(added,VBD),(some,DT),(sugar,NN),(to,TO),(it,PRP),(and,CC),(it,PRP),(turned,VBD),(okay/good,NN),(.,.)}) (6VRbbNQe5ouWmwsMebUMkg,{(Entrees,NNS),(average,VBP),(about,IN),($,$),(10,CD),(-,:),($,$),(13,CD),(.,.)}) (6VRbbNQe5ouWmwsMebUMkg,{(Naan,NN),(ranges,NNS),(from,IN),(about,IN),($,$),(1.50,CD),(-,:),($,$),(3,CD),(.,.)}) (6VRbbNQe5ouWmwsMebUMkg,{(Appetizers,NNS),(during,IN),(happy,JJ),(hour,NN),(range,NN),(from,IN),($,$),(3,CD),(-,:),($,$),(8,CD),(+,CC),(.,.)}) (6VRbbNQe5ouWmwsMebUMkg,{(Add,VB),(in,IN),(alcohol,NN),(and,CC),(you,PRP),('re,VBP),(looking,VBG),(at,IN),(a,DT),(not,RB),(inexpensive,JJ),(meal,NN),(but,CC),(definitely,RB),(good,JJ),(quality,NN),(.,.)}) (6oRAC4uyJCsJl1X0WZpVSA,{(love,VB),(the,DT),(gyro,NN),(plate,NN),(.,.)}) (6oRAC4uyJCsJl1X0WZpVSA,{(Rice,NNP),(is,VBZ),(so,RB),(good,JJ),(and,CC),(I,PRP),(also,RB),(dig,VBP),(their,PRP$),(candy,NN),(selection,NN),(:,:),(-RRB-,-RRB-)})

reviews = LOAD 'data/ten.avro' USING AvroStorage(); reviews = LIMIT reviews 1000; bar = FOREACH reviews GENERATE business_id, varaha.text.StanfordPOSTagger(varaha.text.StanfordTokenize(text)) AS tokens; DUMP bar

(-UnYs8XvV1M983xZoREdng,{(have,VB),(say,VB),(loved,NN),(Vino,NNP),(First,NNP),(off,RB),(very,RB),(unpretentious,JJ),(not,RB),(very,RB),(knowledgeable,JJ),(about,IN),(wine,NN),(tend,VBP),(shy,JJ),(away,RB),(from,IN),(places,NNS),(that,WDT),(have,VBP),(attitude,NN),(also,RB),(had,VBD),(one,CD),(the,DT),(1000,CD),(outstanding,JJ),(Groupons,NNS),(about,IN),(expire,VBP),(And,CC),(spite,NN),(the,DT),(fact,NN),(that,IN),(just,RB),(about,IN),(everyone,NN),(coming,VBG),(that,IN),(evening,NN),(had,VBD),(Groupon,NNP),(the,DT),(staff,NN),(was,VBD),(fantastic,JJ),(they,PRP),(not,RB),(have,VBP),(kitchen,NN),(all,DT),(appetizers,NNS),(are,VBP),(cold,JJ),(but,CC),(had,VBD),(nice,JJ),(cheese,NN),(plate,NN),(which,WDT),(included,VBD),(cheeses,NNS),(olives,NNS),(nuts,NNS),(grapes,NNS),(and,CC),(dried,VBD),(fruit,NN),(only,RB),(complaint,NN),(was,VBD),(that,IN),(the,DT),(lahvosh-like,JJ),(crackers,NNS),(were,VBD),(really,RB),(oily,JJ),(and,CC),(not,RB),(good,JJ),(all,DT),(Lose,VB),(those,DT),(and,CC),(would,MD),(have,VB),(been,VBN),(much,RB),(better,RBR),(for,IN),(the,DT),(wine,NN),(was,VBD),(actually,RB),(better,JJR),(than,IN),(expected,VBN),(Although,IN),(n't,RB),(generally,RB),(care,VB),(for,IN),(really,RB),(sweet,JJ),(wines,NNS),(both,CC),(the,DT),(Summer,NN),(Rain,NN),(and,CC),(Peachy,JJ),(Keen,JJ),(were,VBD),(really,RB),(enjoyable,JJ),(just,RB),(think,VB),(them,PRP),(more,RBR),(crisp,JJ),(summer,NN),(beverage,NN),(than,IN),(wine,NN),(was,VBD),(surprised,VBN),(like,IN),(the,DT),(Pinot,NNP),(Grigio,NNP),(much,RB),(did,VBD),(and,CC),(may,MD),(have,VB),(purchased,VBN),(bottle,NN),(but,CC),(was,VBD),(not,RB),(available,JJ),(that,IN),(evening,NN),(The,DT),(Miscela,NNP),(Italian,NNP),(blend,VB),(was,VBD),(miss,VB),(for,IN),(-LRB-,-LRB-),(too,RB),(acidic,JJ),(for,IN),(taste,NN),(-RRB-,-RRB-),(but,CC),(the,DT),(Malbec,NNP),(was,VBD),(better,JJR),(For,IN),(after,IN),(dinner,NN),(wines,NNS),(the,DT),(Grande,NNP),(Finale,NNP),(was,VBD),(over-the-top,JJ),(sweet,JJ),(would,MD),(probably,RB),(not,RB),(drink,VB),(more,JJR),(than,IN),(tasting,NN),(The,DT),(Porto,NNP),(Cocoa,NNP),(however,RB),(was,VBD),(fantastic,JJ),(generally,RB),(stay,VB),(away,RB),(from,IN),(Port,NNP),(because,IN),(dislike,NN),(the,DT),(brandy,NN),(burn,VBP),(But,CC),(one,CD),(whiff,NN),(this,DT),(and,CC),(was,VBD),(hooked,VBN),(before,IN),(tasted,VBN),(While,IN),(not,RB),(like,IN),(terribly,RB),(sweet,JJ),(you,PRP),(definitely,RB),(get,VBP),(the,DT),(essence,NN),(chocolate,NN),(bought,VBD),(bottle,NN),(take,VB),(home,NN),(fact,NN),(but,CC),(only,RB),(saw,VBD),(one,CD),(wee,NN),(little,JJ),(glass,NN),(husband,NN),(apparently,RB),(mistook,VBD),(for,IN),(Yoo-hoo,NN),(and,CC),(drank,VBD),(the,DT),(rest,NN),(Great,JJ),(place,NN),(begin,VB),(your,PRP$),(evening,NN),(And,CC),(because,IN),(many,JJ),(these,DT),(young,JJ),(wines,NNS),(are,VBP),(sweeter,JJR),(even,RB),(non-wine-drinking,JJ),(husband,NN),(enjoyed,VBN)})

rjurney avatar Dec 24 '13 18:12 rjurney

@rjurney Would you mind squashing these commits so I can look at a single diff?

alienrobotwizard avatar Dec 29 '13 15:12 alienrobotwizard

Yeah, I can do that. I think you can also do that in the interface?

On Sunday, December 29, 2013, Jacob wrote:

@rjurney https://github.com/rjurney Would you mind squashing these commits so I can look at a single diff?

— Reply to this email directly or view it on GitHubhttps://github.com/thedatachef/varaha/pull/4#issuecomment-31319298 .

Russell Jurney twitter.com/rjurney [email protected] datasyndrome.com

rjurney avatar Dec 29 '13 17:12 rjurney

Sorry for taking so long to get to this. Overall it looks good. Can we put the udfs that rely strictly on the stanford nlp package in their own namespace? varaha.text is getting a little crowded.

alienrobotwizard avatar Jan 15 '14 00:01 alienrobotwizard

Yeah, I'll do that.

On Tuesday, January 14, 2014, Jacob wrote:

Sorry for taking so long to get to this. Overall it looks good. Can we put the udfs that rely strictly on the stanford nlp package in their own namespace? varaha.text is getting a little crowded.

— Reply to this email directly or view it on GitHubhttps://github.com/thedatachef/varaha/pull/4#issuecomment-32324389 .

Russell Jurney twitter.com/rjurney [email protected] datasyndrome.com

rjurney avatar Jan 15 '14 02:01 rjurney