DKN icon indicating copy to clipboard operation
DKN copied to clipboard

how do you implement the procedure of entity linking

Open sewellxie opened this issue 5 years ago • 9 comments

Dr. Wang, thank you so much for your wonderful work which combines KG and recommender system. It seems that the procedure of disambiguation based on entity linking is not implemented in this code. Does the 'kg.txt' represent KG that has been disambiguated?

sewellxie avatar Jun 24 '19 09:06 sewellxie

Hi! Entity linking was not done by us but contained in the original dataset. The "kg.txt" is already disambiguated.

hwwang55 avatar Jun 24 '19 17:06 hwwang55

Thank you for your reply! Well another question is that one word in the new title is related to one entity in KG which is described in the paper. But in raw_train.txt and raw_test.txt, there are only one or two or several entities in a news title, the number of entities is not the same as the number of words in news title. So do you just use padding operation to make them the same length or do you have other good handling methods that I have not noticed? Thanks!

sewellxie avatar Jun 25 '19 13:06 sewellxie

Yes, they are padded with zeros.

hwwang55 avatar Jun 25 '19 16:06 hwwang55

Thanks! I am reading your papers recently, including RippleNet, multi-task learning for recommender system and et al. Hope to have further discussion.

sewellxie avatar Jun 26 '19 13:06 sewellxie

@hwwang55 Hi, Dr Wang, Thanks for sharing the code ! I am curious the entity linking method you have used, can you share some ideas ? It's seem affect the recommend result very much. Thanks very much!

feng-1985 avatar Jul 09 '19 09:07 feng-1985

@bifeng Hi! I'm afraid I can't help much since I'm not working on the area of entity linking. I suggest searching "entity linking survey" in Google Scholar and that might be helpful. Thanks!

hwwang55 avatar Jul 09 '19 16:07 hwwang55

@hwwang55 王宏伟老师您好!最近在深入研究DKN的代码,在news_process这一过程中,我发现了一个问题,就是raw_train.txt中的新闻标题的单词和train.txt中的新闻标题的编码在数量上和位置上有很多都是不一致的(实体编码亦然),下面是一些例子 0 tautog bite coming strong 0 36136:Tautog 0 bruce springsteen song magically rejected harry potter 0 331:Bruce Springsteen 0 watch tom cruise recreates iconic movie scenes james corden action packed minutes 0 3410:Tom Cruise 0 chuck says cool hall fame tupac 0 2808:Chuck D 0 big weather changes bethel dramatic change temps windy rain 0 17431:Bethel

0 1,2,3,0,0,0,0,0,0,0 0,0,0,0,0,0,0,0,0,0 0 0 4,5,6,7,8,0,0,0,0,0 2,2,0,0,0,0,0,0,0,0 0 0 9,10,11,12,13,14,15,16,17,18 0,3,3,0,0,0,0,0,0,0 0 0 21,22,23,24,25,26,0,0,0,0 4,0,0,0,0,0,0,0,0,0 0 0 27,28,29,30,31,32,0,0,0,0 0,0,0,0,0,0,0,0,0,0 0 我分别截取了这两个文件中的前五行,比如第一行,在新闻标题中有4个单词,但是编码后只有3个非0码。又比如第二行,在新闻标题中有7个单词,但是编码后只有5个非0码。这种现象占了不小的一部分,想问一下老师,这是正常现象吗?麻烦老师了!

hannlp avatar Aug 07 '19 18:08 hannlp

@1140325971 你好,这是正常的,因为我们设置了一个词频的阈值,高于这个阈值的词才会被考虑。谢谢!

hwwang55 avatar Aug 07 '19 19:08 hwwang55

@hwwang55 感谢老师,祝工作顺利!

hannlp avatar Aug 08 '19 07:08 hannlp