kb2vec icon indicating copy to clipboard operation
kb2vec copied to clipboard

Construct entity sense embeddings using DT and DeepWalk

Open alexanderpanchenko opened this issue 7 years ago • 15 comments

Background

  • Sense Embeddings: https://arxiv.org/pdf/1805.04032.pdf

  • Our sense embedding approach: http://aclweb.org/anthology/W16-1620

Data

  1. A Distributional Thesaurus (DT)
  • http://panchenko.me/data/joint/dt/common-crawl-2016/
  • panchenko@ltdata1:/srv/data/depcc/distributional-models
  • use the model 1000-2000: http://panchenko.me/data/joint/dt/common-crawl-2016/dependency_lemz-true_cooc-false_mxln-110_semf-true_sign-LMI_wpf-1000_fpw-2000_minw-5_minf-5_minwf-2_minsign-0.0_nnn-200/SimPruned/
  • For your reference - these are computed from this corpus: panchenko@ltdata1:/srv/data/depcc/corpus/sentences/cc-2016-en-nohtml-nonoise-sort.txt.gz
  1. Training datasets
  • https://docs.google.com/spreadsheets/d/1reP1Lk2UbxTDZtC7K6LmiXdfeEIWKB432hMTCcB1U5c/edit?usp=sharing

  • vocabulary of the entities: https://docs.google.com/spreadsheets/d/1umTW0h8hGKqN1NSEpgds36qfhFZC4VO5dBjQ940dUY4/edit?usp=sharing

Code

  • WSI: https://github.com/uhh-lt/chinese-whispers , More memory efficient one WSI: https://github.com/nlpub/watset-java

  • Disambiguate sense clusters: https://github.com/uhh-lt/sensegram/blob/master/pcz/make_closure.py

Steps

  1. Take the DT and compute coverage of the target entities from the https://docs.google.com/spreadsheets/d/1umTW0h8hGKqN1NSEpgds36qfhFZC4VO5dBjQ940dUY4/edit?usp=sharing. Report the coverage here.

  2. Build a graph from the DT and compute it’s graph embeddings using DeepWalk.

  • prune from the graph edges with very small (eg t < 0.001) scores
  • ALTERNATIVELY ADDITIONALLY build a graph of target entities and all related words
  1. Report here some nearest neighbors of some entities here like Michael Jordan.

  2. Create a disambiguated graph of senses using the provided code.

  3. Compute embeddings from the graph of senses like before using the DeepWalk. Report sense nearest neighbors.

alexanderpanchenko avatar Dec 17 '18 14:12 alexanderpanchenko

image

alexanderpanchenko avatar Dec 17 '18 14:12 alexanderpanchenko

https://docs.google.com/spreadsheets/d/1umTW0h8hGKqN1NSEpgds36qfhFZC4VO5dBjQ940dUY4/edit?usp=sharing

alexanderpanchenko avatar Jan 24 '19 13:01 alexanderpanchenko

USE: https://github.com/thunlp/openne

alexanderpanchenko avatar Jan 31 '19 13:01 alexanderpanchenko

Disambiguation: https://github.com/nlpub/watset-java#watset-1

alexanderpanchenko avatar Jan 31 '19 13:01 alexanderpanchenko

t >= 0.0005

  • number of edges: 2.28B
  • number of nodes: 12.5M
  • number of MWE: 5.8M
  • size: 31G (edgelist)

alexanderpanchenko avatar Jan 31 '19 13:01 alexanderpanchenko

Running deepwalk on a graph consisting of target entities with score threshold >= 0.001 (only if both words in dt are target entities):

number of edges: 27786 number of nodes: 583 number of MWE: 583

Here are the top 5 similar words for a random selection of target words:

Target Word Top Similar Score
consumer isdn 0.3410990238189697
consumer emotional 0.3382205367088318
consumer web developers 0.3198750913143158
consumer u.s. government 0.31877732276916504
consumer human 0.30722925066947937
house colorado house of representatives 0.3676622807979584
house streets 0.3613239526748657
house walls 0.36042022705078125
house room 0.3583110272884369
house houses 0.3436395525932312
currency trade imbalances 0.5089859962463379
currency renminbi 0.45905500650405884
currency short-term 0.4205421209335327
currency percentage 0.37381264567375183
currency cash 0.3512609899044037
marijuana medical marijuana 0.4971558451652527
marijuana illegal drugs 0.4261803925037384
marijuana worries 0.4189271330833435
marijuana internet privacy 0.4140332341194153
marijuana cancer 0.41302281618118286
september los angeles 0.43724992871284485
september philadelphia 0.4330790042877197
september friday 0.4258647561073303
september paris 0.42521965503692627
september sunday 0.4248284101486206
green carter 0.3986145853996277
green armstrong 0.3940846920013428
green shaped 0.38679683208465576
green garlic 0.3841942548751831
green sharp 0.38037604093551636
japan switzerland 0.631550669670105
japan italy 0.5903221368789673
japan russia 0.5830407738685608
japan spain 0.5740169882774353
japan europe 0.5568861961364746
team group of five 0.5515682101249695
team coaches 0.3639141023159027
team basketball 0.3395881652832031
team sponsors 0.3303707242012024
team soccer 0.31407982110977173
husband father 0.4575813412666321
husband mother 0.4007580280303955
husband women 0.3704793453216553
husband families 0.35903018712997437
husband years 0.34233951568603516
future far east 0.31018203496932983
future economic 0.3064412772655487
future popularity 0.2993215322494507
future republic 0.2954373359680176
future consumer 0.2902454733848572

m-dorgham avatar Feb 06 '19 07:02 m-dorgham

Thanks!

Please also compute the graph and the embeddings for the case when at least ONE word is target entity. This graph seems to be too small...

On 6. Feb 2019, at 08:55, Mohammad Dorgham [email protected] wrote:

Running deepwalk on a graph consisting of target entities with score threshold >= 0.001 (only if both words in dt are target entities):

number of edges: 27786 number of nodes: 583 number of MWE: 583 file size: 200MB

Here are the top 5 similar words for a random selection of target words: consumer isdn 0.3410990238189697 consumer emotional 0.3382205367088318 consumer web developers 0.3198750913143158 consumer u.s. government 0.31877732276916504 consumer human 0.30722925066947937 house colorado house of representatives 0.3676622807979584 house streets 0.3613239526748657 house walls 0.36042022705078125 house room 0.3583110272884369 house houses 0.3436395525932312 currency trade imbalances 0.5089859962463379 currency renminbi 0.45905500650405884 currency short-term 0.4205421209335327 currency percentage 0.37381264567375183 currency cash 0.3512609899044037 marijuana medical marijuana 0.4971558451652527 marijuana illegal drugs 0.4261803925037384 marijuana worries 0.4189271330833435 marijuana internet privacy 0.4140332341194153 marijuana cancer 0.41302281618118286 september los angeles 0.43724992871284485 september philadelphia 0.4330790042877197 september friday 0.4258647561073303 september paris 0.42521965503692627 september sunday 0.4248284101486206 green carter 0.3986145853996277 green armstrong 0.3940846920013428 green shaped 0.38679683208465576 green garlic 0.3841942548751831 green sharp 0.38037604093551636 japan switzerland 0.631550669670105 japan italy 0.5903221368789673 japan russia 0.5830407738685608 japan spain 0.5740169882774353 japan europe 0.5568861961364746 team group of five 0.5515682101249695 team coaches 0.3639141023159027 team basketball 0.3395881652832031 team sponsors 0.3303707242012024 team soccer 0.31407982110977173 husband father 0.4575813412666321 husband mother 0.4007580280303955 husband women 0.3704793453216553 husband families 0.35903018712997437 husband years 0.34233951568603516 future far east 0.31018203496932983 future economic 0.3064412772655487 future popularity 0.2993215322494507 future republic 0.2954373359680176 future consumer 0.2902454733848572

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

alexanderpanchenko avatar Feb 06 '19 09:02 alexanderpanchenko

I am working on it now.

m-dorgham avatar Feb 06 '19 09:02 m-dorgham

Running deepwalk on a graph consisting of target entities with score threshold >= 0.005 (if any word in dt pairs is a target entity):

number of edges: 606272 number of nodes: 94599

Here are the top 5 similar words for a random selection of target words:

Target Word Top Similar Score
consumer searcher 0.8687483668327332
consumer smbs 0.8651067018508911
consumer purchasing 0.8617939949035645
consumer filer 0.861585795879364
consumer consumer credit 0.8607973456382751
house about the house 0.8604483008384705
house inside yourself 0.8564826250076294
house womb 0.851717472076416
house straddle 0.8488346338272095
house sat in 0.8479311466217041
currency monetary system 0.9247364401817322
currency energy prices 0.9235352277755737
currency share prices 0.9232353568077087
currency the debt 0.9229246973991394
currency revaluation 0.9228223562240601
marijuana downer 0.9509781002998352
marijuana medical use 0.9482918977737427
marijuana street drug 0.9478726387023926
marijuana amounts of marijuana 0.9475768804550171
marijuana laxative 0.9466262459754944
september 7 september 0.9549092650413513
september 19 march 0.9534833431243896
september 9th april 0.9526151418685913
september may 13 0.9519420862197876
september august 5 0.9516847133636475
green squishy 0.8713904619216919
green green and purple 0.8638914227485657
green erskine 0.8631000518798828
green par-4 0.8625338077545166
green ham hocks 0.8622951507568359
team academic all-district 0.9194386005401611
team review committee 0.9182767868041992
team most valuable player 0.9173357486724854
team breakaway 0.916580319404602
team nhl teams 0.9163451194763184
husband the widower 0.9049073457717896
husband and my family 0.9009094834327698
husband allegedly 0.8996278643608093
husband her cat 0.8968359231948853
husband & i 0.8958627581596375
future full scale 0.861075758934021
future economic future 0.8609667420387268
future 20th-ranked 0.8600075244903564
future the pipeline 0.8599571585655212
future in the way 0.8586423993110657

m-dorgham avatar Feb 07 '19 21:02 m-dorgham

Thanks! Please also try the same with a smaller threshold.

On 7. Feb 2019, at 22:55, Mohammad Dorgham [email protected] wrote:

Running deepwalk on a graph consisting of target entities with score threshold >= 0.005 (if any word in dt pairs is a target entity):

number of edges: 606272 number of nodes: 94599

Here are the top 5 similar words for a random selection of target words:

Target Word Top Similar Score consumer searcher 0.8687483668327332 consumer smbs 0.8651067018508911 consumer purchasing 0.8617939949035645 consumer filer 0.861585795879364 consumer consumer credit 0.8607973456382751 house about the house 0.8604483008384705 house inside yourself 0.8564826250076294 house womb 0.851717472076416 house straddle 0.8488346338272095 house sat in 0.8479311466217041 currency monetary system 0.9247364401817322 currency energy prices 0.9235352277755737 currency share prices 0.9232353568077087 currency the debt 0.9229246973991394 currency revaluation 0.9228223562240601 marijuana downer 0.9509781002998352 marijuana medical use 0.9482918977737427 marijuana street drug 0.9478726387023926 marijuana amounts of marijuana 0.9475768804550171 marijuana laxative 0.9466262459754944 september 7 september 0.9549092650413513 september 19 march 0.9534833431243896 september 9th april 0.9526151418685913 september may 13 0.9519420862197876 september august 5 0.9516847133636475 green squishy 0.8713904619216919 green green and purple 0.8638914227485657 green erskine 0.8631000518798828 green par-4 0.8625338077545166 green ham hocks 0.8622951507568359 team academic all-district 0.9194386005401611 team review committee 0.9182767868041992 team most valuable player 0.9173357486724854 team breakaway 0.916580319404602 team nhl teams 0.9163451194763184 husband the widower 0.9049073457717896 husband and my family 0.9009094834327698 husband allegedly 0.8996278643608093 husband her cat 0.8968359231948853 husband & i 0.8958627581596375 future full scale 0.861075758934021 future economic future 0.8609667420387268 future 20th-ranked 0.8600075244903564 future the pipeline 0.8599571585655212 future in the way 0.8586423993110657 — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

alexanderpanchenko avatar Feb 07 '19 22:02 alexanderpanchenko

Running deepwalk on a graph consisting of target entities with score threshold >= 0.001 (if any word in dt pairs is a target entity):

number of edges: 4.5M number of nodes: 1M

Here are the top 5 similar words for a random selection of target words:

Target Word Top Similar Score
consumer american police 0.9119635820388794
consumer home-office 0.9119197726249695
consumer crossborder 0.9113563299179077
consumer climate depot 0.9112616181373596
consumer application-to-application 0.9107170104980469
house cross breeding 0.8913893103599548
house hereditaments 0.8868686556816101
house cricket pavilion 0.886616587638855
house borkman 0.8865180015563965
house ancient demesne 0.885732114315033
currency non-alcoholic beverage 0.9314658641815186
currency foreign exchange controls 0.9305030107498169
currency electroweak 0.9282821416854858
currency baby step 0.9277881979942322
currency perma-recession 0.9275274872779846
marijuana mood-altering 0.9447073936462402
marijuana semi-automatic handguns 0.9442516565322876
marijuana secondary infections 0.9442481398582458
marijuana telazol 0.9437323808670044
marijuana tylenol 4 0.943712592124939
september 10 november 0.9426672458648682
september 4to 0.9422460794448853
september 1st of january 0.9421955347061157
september novembris 0.9418485164642334
september 29th of april 0.941774845123291
green regreen 0.895431399345398
green lanceolate-ovate 0.8927381634712219
green thomas thomas 0.8921957015991211
green nerved 0.8914358615875244
green eastern amberwing 0.8913373947143555
japan free japan 0.9051642417907715
japan doretree.ps 0.9007513523101807
japan sandai 0.9007253050804138
japan deep waters 0.8995295763015747
japan honen 0.8988988399505615
team spending some time 0.9124458432197571
team caretaker cabinet 0.9120877981185913
team 16-3a 0.9115908741950989
team kirsten olson 0.9113165140151978
team never have i ever 0.9111958146095276
husband christopher cassidy 0.927372932434082
husband phjc 0.9271283149719238
husband the beagle 0.9264905452728271
husband rick husband 0.9253169298171997
husband thomas cornwall 0.9246828556060791
future pre-unibody 0.8833345174789429
future in combo 0.8828374147415161
future dloc 0.8823624849319458
future future-style 0.8822189569473267
future atv-3 0.8817929625511169

m-dorgham avatar Feb 08 '19 16:02 m-dorgham

Running deepwalk on a graph consisting of target entities with score threshold >= 0.05 (if any word in dt pairs is a target entity):

number of edges: 98723 number of nodes: 18555

Here are the top 5 similar words for a random selection of target words:

Target Word Top Similar Score
consumer grower 0.9421052932739258
consumer small businesses 0.9418331384658813
consumer local residents 0.9415016770362854
consumer legislator 0.9412452578544617
consumer traveler 0.9407768249511719
house held in 0.8956223130226135
house built in 0.8950538635253906
house situate 0.8934398293495178
house the lodge 0.8932690620422363
house old house 0.8921085596084595
currency foreign currencies 0.966582179069519
currency paper money 0.9638655185699463
currency exchange rate 0.961968183517456
currency banknote 0.9612525701522827
currency local currency 0.9610164165496826
marijuana valium 0.9826380014419556
marijuana oxycontin 0.9818234443664551
marijuana xanax 0.9816401600837708
marijuana drug use 0.9815698266029358
marijuana prostitution 0.9814484119415283
september 11 september 0.9567168951034546
september december 2 0.9558318257331848
september december 2006 0.9531803131103516
september august 25 0.9528341293334961
september july 5 0.9526382684707642
green lush 0.8624919652938843
green potato salad 0.8573336601257324
green milky 0.8552415370941162
green big green 0.8520586490631104
green wild rice 0.8512767553329468
japan latin 0.814257025718689
japan pornstar 0.8112776279449463
japan pov 0.8069230318069458
japan anal 0.8066928386688232
japan upskirt 0.8064553141593933
team go wrong 0.9578524827957153
team the jets 0.9575368762016296
team working 0.9559540152549744
team partnership 0.9557927846908569
team new deal 0.9550840258598328
husband great-grandchild 0.9026515483856201
husband my partner 0.8973795771598816
husband youngest son 0.8966842889785767
husband myself 0.8943066596984863
husband five brothers 0.8937571048736572
future for the future 0.861904501914978
future prospects 0.8586996793746948
future our future 0.8576727509498596
future unauthorized 0.8575726747512817
future and future 0.8547124266624451

m-dorgham avatar Feb 08 '19 16:02 m-dorgham

  1. generate human readable graphs (as below)

  2. apply the new implementation of DW

  3. use the NN.py try to run it and integrate some DW embeddings instead of word embeddings

alexanderpanchenko avatar Feb 08 '19 17:02 alexanderpanchenko

Running weighted deepwalk with score threshold >= 0.05 (if any word in dt pairs is a target entity):

number of edges: 98723 number of nodes: 18555

Here are the top 5 similar words for a random selection of target words:

Target Word Top Similar Score
consumer policymaker 0.8881332874298096
consumer small businesses 0.8821403980255127
consumer the employees 0.878178060054779
consumer end user 0.8779093027114868
consumer insurer 0.8755150437355042
house dates back 0.7990361452102661
house confine 0.7966662049293518
house guesthouse 0.7817363739013672
house brick building 0.7798301577568054
house relocate 0.778267502784729
currency national currency 0.9425667524337769
currency foreign currencies 0.9253323078155518
currency foreign currency 0.9250749945640564
currency exchange rate 0.9230431318283081
currency banknote 0.9196319580078125
marijuana opiate 0.9636061191558838
marijuana mdma 0.9632755517959595
marijuana drug use 0.9631142020225525
marijuana paraphernalia 0.9625576734542847
marijuana methamphetamine 0.9620795845985413
september december 18 0.8988827466964722
september 1 january 0.8937128186225891
september 8 may 0.8925288319587708
september june 2004 0.8919708728790283
september 20 december 0.8914341926574707
green chicory 0.7948091626167297
green herbal 0.7785643339157104
green white rice 0.7748297452926636
green endive 0.7694216370582581
green mixed greens 0.7671592235565186
japan gangbang 0.7603057622909546
japan virgin 0.7500285506248474
japan cum 0.7496156096458435
japan pornstar 0.74860018491745
japan porno 0.7465837001800537
team the ravens 0.8964948654174805
team bears 0.8963637351989746
team collaborate 0.8928627967834473
team ink 0.8916406631469727
team berth 0.890237033367157
husband infant son 0.8357791304588318
husband a son 0.8309561014175415
husband my husband 0.8303523063659668
husband one son 0.8279650807380676
husband grandchild 0.8271342515945435
future successive 0.7763746976852417
future adulthood 0.7595061659812927
future prospects 0.7573720216751099
future trading 0.7554386854171753
future our future 0.7527661323547363

m-dorgham avatar Feb 15 '19 15:02 m-dorgham

Running weighted deepwalk with score threshold >= 0.01 (if any word in dt pairs is a target entity):

number of edges: 606272 number of nodes: 94599

Here are the top 5 similar words for a random selection of target words:

Target Word Top Similar Score
consumer ratepayer 0.7707870602607727
consumer privacy advocates 0.7688286304473877
consumer refiner 0.7598727941513062
consumer senior executives 0.759567379951477
consumer american families 0.7533906698226929
house the whitehouse 0.728789746761322
house the only place 0.7084541916847229
house fixed position 0.7049554586410522
house mud huts 0.70444655418396
house spending the night 0.7022849321365356
currency commodities 0.8456035256385803
currency fiat money 0.834477961063385
currency functional currency 0.833740770816803
currency payment system 0.8276917934417725
currency inflation rate 0.8271000981330872
marijuana anabolic steroids 0.8885903358459473
marijuana tylenol 0.886154294013977
marijuana illegal drugs 0.8843708038330078
marijuana pain killers 0.8840863108634949
marijuana addictive drug 0.8835833668708801
september september 18th 0.8800647258758545
september march 3 0.8782463073730469
september october 1994 0.8751147985458374
september september 22nd 0.8749075531959534
september 30 december 0.8726902604103088
green unflavored 0.7157471179962158
green primeval 0.7130362391471863
green glabrescent 0.7099554538726807
green swampy 0.7083736658096313
green wild green 0.7005715370178223
japan pref 0.7055085897445679
japan sex anal 0.6905528903007507
japan kiddy 0.6893953680992126
japan cartoon porn 0.6866461038589478
japan tit fuck 0.6777403354644775
team relay teams 0.8330321311950684
team love-hate relationship 0.8062129616737366
team in touch 0.8061914443969727
team cutting ties 0.8052033185958862
team rugby team 0.8051971197128296
husband daughters of 0.815686047077179
husband my wifes 0.8116765022277832
husband loved mother 0.7964452505111694
husband grandkid 0.7821117043495178
husband son of james 0.7787513136863708
future things to come 0.713495135307312
future into the wind 0.7111729383468628
future on-again 0.7105178833007812
future sort-of 0.7062144875526428
future off-again 0.7049633264541626

m-dorgham avatar Feb 15 '19 15:02 m-dorgham