Construct entity sense embeddings using DT and DeepWalk
Background
-
Sense Embeddings: https://arxiv.org/pdf/1805.04032.pdf
-
Our sense embedding approach: http://aclweb.org/anthology/W16-1620
Data
- A Distributional Thesaurus (DT)
- http://panchenko.me/data/joint/dt/common-crawl-2016/
- panchenko@ltdata1:/srv/data/depcc/distributional-models
- use the model 1000-2000: http://panchenko.me/data/joint/dt/common-crawl-2016/dependency_lemz-true_cooc-false_mxln-110_semf-true_sign-LMI_wpf-1000_fpw-2000_minw-5_minf-5_minwf-2_minsign-0.0_nnn-200/SimPruned/
- For your reference - these are computed from this corpus: panchenko@ltdata1:/srv/data/depcc/corpus/sentences/cc-2016-en-nohtml-nonoise-sort.txt.gz
- Training datasets
-
https://docs.google.com/spreadsheets/d/1reP1Lk2UbxTDZtC7K6LmiXdfeEIWKB432hMTCcB1U5c/edit?usp=sharing
-
vocabulary of the entities: https://docs.google.com/spreadsheets/d/1umTW0h8hGKqN1NSEpgds36qfhFZC4VO5dBjQ940dUY4/edit?usp=sharing
Code
-
WSI: https://github.com/uhh-lt/chinese-whispers , More memory efficient one WSI: https://github.com/nlpub/watset-java
-
Disambiguate sense clusters: https://github.com/uhh-lt/sensegram/blob/master/pcz/make_closure.py
Steps
-
Take the DT and compute coverage of the target entities from the https://docs.google.com/spreadsheets/d/1umTW0h8hGKqN1NSEpgds36qfhFZC4VO5dBjQ940dUY4/edit?usp=sharing. Report the coverage here.
-
Build a graph from the DT and compute it’s graph embeddings using DeepWalk.
- prune from the graph edges with very small (eg t < 0.001) scores
- ALTERNATIVELY ADDITIONALLY build a graph of target entities and all related words
-
Report here some nearest neighbors of some entities here like Michael Jordan.
-
Create a disambiguated graph of senses using the provided code.
-
Compute embeddings from the graph of senses like before using the DeepWalk. Report sense nearest neighbors.
https://docs.google.com/spreadsheets/d/1umTW0h8hGKqN1NSEpgds36qfhFZC4VO5dBjQ940dUY4/edit?usp=sharing
USE: https://github.com/thunlp/openne
Disambiguation: https://github.com/nlpub/watset-java#watset-1
t >= 0.0005
- number of edges: 2.28B
- number of nodes: 12.5M
- number of MWE: 5.8M
- size: 31G (edgelist)
Running deepwalk on a graph consisting of target entities with score threshold >= 0.001 (only if both words in dt are target entities):
number of edges: 27786 number of nodes: 583 number of MWE: 583
Here are the top 5 similar words for a random selection of target words:
| Target Word | Top Similar | Score |
|---|---|---|
| consumer | isdn | 0.3410990238189697 |
| consumer | emotional | 0.3382205367088318 |
| consumer | web developers | 0.3198750913143158 |
| consumer | u.s. government | 0.31877732276916504 |
| consumer | human | 0.30722925066947937 |
| house | colorado house of representatives | 0.3676622807979584 |
| house | streets | 0.3613239526748657 |
| house | walls | 0.36042022705078125 |
| house | room | 0.3583110272884369 |
| house | houses | 0.3436395525932312 |
| currency | trade imbalances | 0.5089859962463379 |
| currency | renminbi | 0.45905500650405884 |
| currency | short-term | 0.4205421209335327 |
| currency | percentage | 0.37381264567375183 |
| currency | cash | 0.3512609899044037 |
| marijuana | medical marijuana | 0.4971558451652527 |
| marijuana | illegal drugs | 0.4261803925037384 |
| marijuana | worries | 0.4189271330833435 |
| marijuana | internet privacy | 0.4140332341194153 |
| marijuana | cancer | 0.41302281618118286 |
| september | los angeles | 0.43724992871284485 |
| september | philadelphia | 0.4330790042877197 |
| september | friday | 0.4258647561073303 |
| september | paris | 0.42521965503692627 |
| september | sunday | 0.4248284101486206 |
| green | carter | 0.3986145853996277 |
| green | armstrong | 0.3940846920013428 |
| green | shaped | 0.38679683208465576 |
| green | garlic | 0.3841942548751831 |
| green | sharp | 0.38037604093551636 |
| japan | switzerland | 0.631550669670105 |
| japan | italy | 0.5903221368789673 |
| japan | russia | 0.5830407738685608 |
| japan | spain | 0.5740169882774353 |
| japan | europe | 0.5568861961364746 |
| team | group of five | 0.5515682101249695 |
| team | coaches | 0.3639141023159027 |
| team | basketball | 0.3395881652832031 |
| team | sponsors | 0.3303707242012024 |
| team | soccer | 0.31407982110977173 |
| husband | father | 0.4575813412666321 |
| husband | mother | 0.4007580280303955 |
| husband | women | 0.3704793453216553 |
| husband | families | 0.35903018712997437 |
| husband | years | 0.34233951568603516 |
| future | far east | 0.31018203496932983 |
| future | economic | 0.3064412772655487 |
| future | popularity | 0.2993215322494507 |
| future | republic | 0.2954373359680176 |
| future | consumer | 0.2902454733848572 |
Thanks!
Please also compute the graph and the embeddings for the case when at least ONE word is target entity. This graph seems to be too small...
On 6. Feb 2019, at 08:55, Mohammad Dorgham [email protected] wrote:
Running deepwalk on a graph consisting of target entities with score threshold >= 0.001 (only if both words in dt are target entities):
number of edges: 27786 number of nodes: 583 number of MWE: 583 file size: 200MB
Here are the top 5 similar words for a random selection of target words: consumer isdn 0.3410990238189697 consumer emotional 0.3382205367088318 consumer web developers 0.3198750913143158 consumer u.s. government 0.31877732276916504 consumer human 0.30722925066947937 house colorado house of representatives 0.3676622807979584 house streets 0.3613239526748657 house walls 0.36042022705078125 house room 0.3583110272884369 house houses 0.3436395525932312 currency trade imbalances 0.5089859962463379 currency renminbi 0.45905500650405884 currency short-term 0.4205421209335327 currency percentage 0.37381264567375183 currency cash 0.3512609899044037 marijuana medical marijuana 0.4971558451652527 marijuana illegal drugs 0.4261803925037384 marijuana worries 0.4189271330833435 marijuana internet privacy 0.4140332341194153 marijuana cancer 0.41302281618118286 september los angeles 0.43724992871284485 september philadelphia 0.4330790042877197 september friday 0.4258647561073303 september paris 0.42521965503692627 september sunday 0.4248284101486206 green carter 0.3986145853996277 green armstrong 0.3940846920013428 green shaped 0.38679683208465576 green garlic 0.3841942548751831 green sharp 0.38037604093551636 japan switzerland 0.631550669670105 japan italy 0.5903221368789673 japan russia 0.5830407738685608 japan spain 0.5740169882774353 japan europe 0.5568861961364746 team group of five 0.5515682101249695 team coaches 0.3639141023159027 team basketball 0.3395881652832031 team sponsors 0.3303707242012024 team soccer 0.31407982110977173 husband father 0.4575813412666321 husband mother 0.4007580280303955 husband women 0.3704793453216553 husband families 0.35903018712997437 husband years 0.34233951568603516 future far east 0.31018203496932983 future economic 0.3064412772655487 future popularity 0.2993215322494507 future republic 0.2954373359680176 future consumer 0.2902454733848572
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.
I am working on it now.
Running deepwalk on a graph consisting of target entities with score threshold >= 0.005 (if any word in dt pairs is a target entity):
number of edges: 606272 number of nodes: 94599
Here are the top 5 similar words for a random selection of target words:
| Target Word | Top Similar | Score |
|---|---|---|
| consumer | searcher | 0.8687483668327332 |
| consumer | smbs | 0.8651067018508911 |
| consumer | purchasing | 0.8617939949035645 |
| consumer | filer | 0.861585795879364 |
| consumer | consumer credit | 0.8607973456382751 |
| house | about the house | 0.8604483008384705 |
| house | inside yourself | 0.8564826250076294 |
| house | womb | 0.851717472076416 |
| house | straddle | 0.8488346338272095 |
| house | sat in | 0.8479311466217041 |
| currency | monetary system | 0.9247364401817322 |
| currency | energy prices | 0.9235352277755737 |
| currency | share prices | 0.9232353568077087 |
| currency | the debt | 0.9229246973991394 |
| currency | revaluation | 0.9228223562240601 |
| marijuana | downer | 0.9509781002998352 |
| marijuana | medical use | 0.9482918977737427 |
| marijuana | street drug | 0.9478726387023926 |
| marijuana | amounts of marijuana | 0.9475768804550171 |
| marijuana | laxative | 0.9466262459754944 |
| september | 7 september | 0.9549092650413513 |
| september | 19 march | 0.9534833431243896 |
| september | 9th april | 0.9526151418685913 |
| september | may 13 | 0.9519420862197876 |
| september | august 5 | 0.9516847133636475 |
| green | squishy | 0.8713904619216919 |
| green | green and purple | 0.8638914227485657 |
| green | erskine | 0.8631000518798828 |
| green | par-4 | 0.8625338077545166 |
| green | ham hocks | 0.8622951507568359 |
| team | academic all-district | 0.9194386005401611 |
| team | review committee | 0.9182767868041992 |
| team | most valuable player | 0.9173357486724854 |
| team | breakaway | 0.916580319404602 |
| team | nhl teams | 0.9163451194763184 |
| husband | the widower | 0.9049073457717896 |
| husband | and my family | 0.9009094834327698 |
| husband | allegedly | 0.8996278643608093 |
| husband | her cat | 0.8968359231948853 |
| husband | & i | 0.8958627581596375 |
| future | full scale | 0.861075758934021 |
| future | economic future | 0.8609667420387268 |
| future | 20th-ranked | 0.8600075244903564 |
| future | the pipeline | 0.8599571585655212 |
| future | in the way | 0.8586423993110657 |
Thanks! Please also try the same with a smaller threshold.
On 7. Feb 2019, at 22:55, Mohammad Dorgham [email protected] wrote:
Running deepwalk on a graph consisting of target entities with score threshold >= 0.005 (if any word in dt pairs is a target entity):
number of edges: 606272 number of nodes: 94599
Here are the top 5 similar words for a random selection of target words:
Target Word Top Similar Score consumer searcher 0.8687483668327332 consumer smbs 0.8651067018508911 consumer purchasing 0.8617939949035645 consumer filer 0.861585795879364 consumer consumer credit 0.8607973456382751 house about the house 0.8604483008384705 house inside yourself 0.8564826250076294 house womb 0.851717472076416 house straddle 0.8488346338272095 house sat in 0.8479311466217041 currency monetary system 0.9247364401817322 currency energy prices 0.9235352277755737 currency share prices 0.9232353568077087 currency the debt 0.9229246973991394 currency revaluation 0.9228223562240601 marijuana downer 0.9509781002998352 marijuana medical use 0.9482918977737427 marijuana street drug 0.9478726387023926 marijuana amounts of marijuana 0.9475768804550171 marijuana laxative 0.9466262459754944 september 7 september 0.9549092650413513 september 19 march 0.9534833431243896 september 9th april 0.9526151418685913 september may 13 0.9519420862197876 september august 5 0.9516847133636475 green squishy 0.8713904619216919 green green and purple 0.8638914227485657 green erskine 0.8631000518798828 green par-4 0.8625338077545166 green ham hocks 0.8622951507568359 team academic all-district 0.9194386005401611 team review committee 0.9182767868041992 team most valuable player 0.9173357486724854 team breakaway 0.916580319404602 team nhl teams 0.9163451194763184 husband the widower 0.9049073457717896 husband and my family 0.9009094834327698 husband allegedly 0.8996278643608093 husband her cat 0.8968359231948853 husband & i 0.8958627581596375 future full scale 0.861075758934021 future economic future 0.8609667420387268 future 20th-ranked 0.8600075244903564 future the pipeline 0.8599571585655212 future in the way 0.8586423993110657 — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.
Running deepwalk on a graph consisting of target entities with score threshold >= 0.001 (if any word in dt pairs is a target entity):
number of edges: 4.5M number of nodes: 1M
Here are the top 5 similar words for a random selection of target words:
| Target Word | Top Similar | Score |
|---|---|---|
| consumer | american police | 0.9119635820388794 |
| consumer | home-office | 0.9119197726249695 |
| consumer | crossborder | 0.9113563299179077 |
| consumer | climate depot | 0.9112616181373596 |
| consumer | application-to-application | 0.9107170104980469 |
| house | cross breeding | 0.8913893103599548 |
| house | hereditaments | 0.8868686556816101 |
| house | cricket pavilion | 0.886616587638855 |
| house | borkman | 0.8865180015563965 |
| house | ancient demesne | 0.885732114315033 |
| currency | non-alcoholic beverage | 0.9314658641815186 |
| currency | foreign exchange controls | 0.9305030107498169 |
| currency | electroweak | 0.9282821416854858 |
| currency | baby step | 0.9277881979942322 |
| currency | perma-recession | 0.9275274872779846 |
| marijuana | mood-altering | 0.9447073936462402 |
| marijuana | semi-automatic handguns | 0.9442516565322876 |
| marijuana | secondary infections | 0.9442481398582458 |
| marijuana | telazol | 0.9437323808670044 |
| marijuana | tylenol 4 | 0.943712592124939 |
| september | 10 november | 0.9426672458648682 |
| september | 4to | 0.9422460794448853 |
| september | 1st of january | 0.9421955347061157 |
| september | novembris | 0.9418485164642334 |
| september | 29th of april | 0.941774845123291 |
| green | regreen | 0.895431399345398 |
| green | lanceolate-ovate | 0.8927381634712219 |
| green | thomas thomas | 0.8921957015991211 |
| green | nerved | 0.8914358615875244 |
| green | eastern amberwing | 0.8913373947143555 |
| japan | free japan | 0.9051642417907715 |
| japan | doretree.ps | 0.9007513523101807 |
| japan | sandai | 0.9007253050804138 |
| japan | deep waters | 0.8995295763015747 |
| japan | honen | 0.8988988399505615 |
| team | spending some time | 0.9124458432197571 |
| team | caretaker cabinet | 0.9120877981185913 |
| team | 16-3a | 0.9115908741950989 |
| team | kirsten olson | 0.9113165140151978 |
| team | never have i ever | 0.9111958146095276 |
| husband | christopher cassidy | 0.927372932434082 |
| husband | phjc | 0.9271283149719238 |
| husband | the beagle | 0.9264905452728271 |
| husband | rick husband | 0.9253169298171997 |
| husband | thomas cornwall | 0.9246828556060791 |
| future | pre-unibody | 0.8833345174789429 |
| future | in combo | 0.8828374147415161 |
| future | dloc | 0.8823624849319458 |
| future | future-style | 0.8822189569473267 |
| future | atv-3 | 0.8817929625511169 |
Running deepwalk on a graph consisting of target entities with score threshold >= 0.05 (if any word in dt pairs is a target entity):
number of edges: 98723 number of nodes: 18555
Here are the top 5 similar words for a random selection of target words:
| Target Word | Top Similar | Score |
|---|---|---|
| consumer | grower | 0.9421052932739258 |
| consumer | small businesses | 0.9418331384658813 |
| consumer | local residents | 0.9415016770362854 |
| consumer | legislator | 0.9412452578544617 |
| consumer | traveler | 0.9407768249511719 |
| house | held in | 0.8956223130226135 |
| house | built in | 0.8950538635253906 |
| house | situate | 0.8934398293495178 |
| house | the lodge | 0.8932690620422363 |
| house | old house | 0.8921085596084595 |
| currency | foreign currencies | 0.966582179069519 |
| currency | paper money | 0.9638655185699463 |
| currency | exchange rate | 0.961968183517456 |
| currency | banknote | 0.9612525701522827 |
| currency | local currency | 0.9610164165496826 |
| marijuana | valium | 0.9826380014419556 |
| marijuana | oxycontin | 0.9818234443664551 |
| marijuana | xanax | 0.9816401600837708 |
| marijuana | drug use | 0.9815698266029358 |
| marijuana | prostitution | 0.9814484119415283 |
| september | 11 september | 0.9567168951034546 |
| september | december 2 | 0.9558318257331848 |
| september | december 2006 | 0.9531803131103516 |
| september | august 25 | 0.9528341293334961 |
| september | july 5 | 0.9526382684707642 |
| green | lush | 0.8624919652938843 |
| green | potato salad | 0.8573336601257324 |
| green | milky | 0.8552415370941162 |
| green | big green | 0.8520586490631104 |
| green | wild rice | 0.8512767553329468 |
| japan | latin | 0.814257025718689 |
| japan | pornstar | 0.8112776279449463 |
| japan | pov | 0.8069230318069458 |
| japan | anal | 0.8066928386688232 |
| japan | upskirt | 0.8064553141593933 |
| team | go wrong | 0.9578524827957153 |
| team | the jets | 0.9575368762016296 |
| team | working | 0.9559540152549744 |
| team | partnership | 0.9557927846908569 |
| team | new deal | 0.9550840258598328 |
| husband | great-grandchild | 0.9026515483856201 |
| husband | my partner | 0.8973795771598816 |
| husband | youngest son | 0.8966842889785767 |
| husband | myself | 0.8943066596984863 |
| husband | five brothers | 0.8937571048736572 |
| future | for the future | 0.861904501914978 |
| future | prospects | 0.8586996793746948 |
| future | our future | 0.8576727509498596 |
| future | unauthorized | 0.8575726747512817 |
| future | and future | 0.8547124266624451 |
-
generate human readable graphs (as below)
-
apply the new implementation of DW
-
use the NN.py try to run it and integrate some DW embeddings instead of word embeddings
Running weighted deepwalk with score threshold >= 0.05 (if any word in dt pairs is a target entity):
number of edges: 98723 number of nodes: 18555
Here are the top 5 similar words for a random selection of target words:
| Target Word | Top Similar | Score |
|---|---|---|
| consumer | policymaker | 0.8881332874298096 |
| consumer | small businesses | 0.8821403980255127 |
| consumer | the employees | 0.878178060054779 |
| consumer | end user | 0.8779093027114868 |
| consumer | insurer | 0.8755150437355042 |
| house | dates back | 0.7990361452102661 |
| house | confine | 0.7966662049293518 |
| house | guesthouse | 0.7817363739013672 |
| house | brick building | 0.7798301577568054 |
| house | relocate | 0.778267502784729 |
| currency | national currency | 0.9425667524337769 |
| currency | foreign currencies | 0.9253323078155518 |
| currency | foreign currency | 0.9250749945640564 |
| currency | exchange rate | 0.9230431318283081 |
| currency | banknote | 0.9196319580078125 |
| marijuana | opiate | 0.9636061191558838 |
| marijuana | mdma | 0.9632755517959595 |
| marijuana | drug use | 0.9631142020225525 |
| marijuana | paraphernalia | 0.9625576734542847 |
| marijuana | methamphetamine | 0.9620795845985413 |
| september | december 18 | 0.8988827466964722 |
| september | 1 january | 0.8937128186225891 |
| september | 8 may | 0.8925288319587708 |
| september | june 2004 | 0.8919708728790283 |
| september | 20 december | 0.8914341926574707 |
| green | chicory | 0.7948091626167297 |
| green | herbal | 0.7785643339157104 |
| green | white rice | 0.7748297452926636 |
| green | endive | 0.7694216370582581 |
| green | mixed greens | 0.7671592235565186 |
| japan | gangbang | 0.7603057622909546 |
| japan | virgin | 0.7500285506248474 |
| japan | cum | 0.7496156096458435 |
| japan | pornstar | 0.74860018491745 |
| japan | porno | 0.7465837001800537 |
| team | the ravens | 0.8964948654174805 |
| team | bears | 0.8963637351989746 |
| team | collaborate | 0.8928627967834473 |
| team | ink | 0.8916406631469727 |
| team | berth | 0.890237033367157 |
| husband | infant son | 0.8357791304588318 |
| husband | a son | 0.8309561014175415 |
| husband | my husband | 0.8303523063659668 |
| husband | one son | 0.8279650807380676 |
| husband | grandchild | 0.8271342515945435 |
| future | successive | 0.7763746976852417 |
| future | adulthood | 0.7595061659812927 |
| future | prospects | 0.7573720216751099 |
| future | trading | 0.7554386854171753 |
| future | our future | 0.7527661323547363 |
Running weighted deepwalk with score threshold >= 0.01 (if any word in dt pairs is a target entity):
number of edges: 606272 number of nodes: 94599
Here are the top 5 similar words for a random selection of target words:
| Target Word | Top Similar | Score |
|---|---|---|
| consumer | ratepayer | 0.7707870602607727 |
| consumer | privacy advocates | 0.7688286304473877 |
| consumer | refiner | 0.7598727941513062 |
| consumer | senior executives | 0.759567379951477 |
| consumer | american families | 0.7533906698226929 |
| house | the whitehouse | 0.728789746761322 |
| house | the only place | 0.7084541916847229 |
| house | fixed position | 0.7049554586410522 |
| house | mud huts | 0.70444655418396 |
| house | spending the night | 0.7022849321365356 |
| currency | commodities | 0.8456035256385803 |
| currency | fiat money | 0.834477961063385 |
| currency | functional currency | 0.833740770816803 |
| currency | payment system | 0.8276917934417725 |
| currency | inflation rate | 0.8271000981330872 |
| marijuana | anabolic steroids | 0.8885903358459473 |
| marijuana | tylenol | 0.886154294013977 |
| marijuana | illegal drugs | 0.8843708038330078 |
| marijuana | pain killers | 0.8840863108634949 |
| marijuana | addictive drug | 0.8835833668708801 |
| september | september 18th | 0.8800647258758545 |
| september | march 3 | 0.8782463073730469 |
| september | october 1994 | 0.8751147985458374 |
| september | september 22nd | 0.8749075531959534 |
| september | 30 december | 0.8726902604103088 |
| green | unflavored | 0.7157471179962158 |
| green | primeval | 0.7130362391471863 |
| green | glabrescent | 0.7099554538726807 |
| green | swampy | 0.7083736658096313 |
| green | wild green | 0.7005715370178223 |
| japan | pref | 0.7055085897445679 |
| japan | sex anal | 0.6905528903007507 |
| japan | kiddy | 0.6893953680992126 |
| japan | cartoon porn | 0.6866461038589478 |
| japan | tit fuck | 0.6777403354644775 |
| team | relay teams | 0.8330321311950684 |
| team | love-hate relationship | 0.8062129616737366 |
| team | in touch | 0.8061914443969727 |
| team | cutting ties | 0.8052033185958862 |
| team | rugby team | 0.8051971197128296 |
| husband | daughters of | 0.815686047077179 |
| husband | my wifes | 0.8116765022277832 |
| husband | loved mother | 0.7964452505111694 |
| husband | grandkid | 0.7821117043495178 |
| husband | son of james | 0.7787513136863708 |
| future | things to come | 0.713495135307312 |
| future | into the wind | 0.7111729383468628 |
| future | on-again | 0.7105178833007812 |
| future | sort-of | 0.7062144875526428 |
| future | off-again | 0.7049633264541626 |