OpenEA icon indicating copy to clipboard operation
OpenEA copied to clipboard

Dataset with the entity label

Open john012343210 opened this issue 3 years ago • 3 comments

Hello author, in the paper, the following part is mentioned.

"Considering that DBpedia, Wikidata and YAGO collect data from very similar sources (mainly, Wikipedia), the aligned entities usually have identical labels. They would become “tricky” features for entity alignment and influence the evaluation of real performance. According to the suggestion in [95], we delete entity label"

  1. May I know if this label refers to the type of that entity?( for example, the type of Michael_Jordan is Person)

  2. Do you still have the dataset with all the labels? I would like to see whether this label could help to embed in some interesting way. If not, I might have to do some crawling to DBpedia and wikidata.

Thanks!

john012343210 avatar Nov 17 '20 08:11 john012343210

Hi,

Sorry for my late reply.

  1. Entity labels mean the names of entities, not the types.
  2. For example, the labels of English DBpedia can be downloaded from http://downloads.dbpedia.org/2016-10/core/labels_en.ttl.bz2.

Note that, as the multilingual versions of DBpedia are extracted from the same resource Wikipedia, most of the aligned entities have the same name. In this case, using names to align these entities may achieve high accuracy. But in the real entity alignment scenario, such as aligning English KGs to a low-resource one, or the case where entity names are not available, the methods using entity names may not work well. So, we do not recommend using entity names. More robust features and methods for entity alignment are worth exploring.

sunzequn avatar Nov 23 '20 11:11 sunzequn

Hi,

Sorry for my late reply.

1. Entity labels mean the names of entities, not the types.

2. For example, the labels of English DBpedia can be downloaded from http://downloads.dbpedia.org/2016-10/core/labels_en.ttl.bz2.

Note that, as the multilingual versions of DBpedia are extracted from the same resource Wikipedia, most of the aligned entities have the same name. In this case, using names to align these entities may achieve high accuracy. But in the real entity alignment scenario, such as aligning English KGs to a low-resource one, or the case where entity names are not available, the methods using entity names may not work well. So, we do not recommend using entity names. More robust features and methods for entity alignment are worth exploring.

Hi, You hava done good work. It can be regarded as the foundation of entity alignment field. Everyone uses your dataset. As you said, I can think of the new version of the dataset(v2.0) as being constructed by removing information about entity names from attribute triples. So I found out in my experiment the effect of some model (such rdgcn, multike) experiments using entity name information has decreased a lot. Other models which not using name information the effect is similar to that in the paper. Excuse me ,May we think in this way?

MrYxJ avatar Dec 21 '20 03:12 MrYxJ

Hi @MrYxJ ,

Apologies for the late reply again. Indeed you are right. To elaborate a bit more, we would also point out that there could be issues of fair comparison and test data leakage in cross-lingual EA in some prior studies where entity names are incorporated. This is not essentially due to embedding entity names, but due to some additional cross-lingual supervision labels/signals. E.g., in the original RDGCN and GCN-JE papers, the authors used Google Translate to translate surface forms of entities in all other languages to English, then initialize the entity embeddings in their model with pre-trained word embedding of translated entity names. This is problematic in two ways:

  1. Developping the MT system obviously have used much more cross-lingual training data that could have subsumed many of testing labels at here. Hence, the incorporation of such cross-lingual signals, instead of using only the training labels in the benchmarks, could have (indirectly) caused testing label leakage to training.
  2. In terms of fair comparison, the majority of prior models (partly being listed in https://github.com/THU-KEG/Entity_Alignment_Papers) are trained from scratch with only the training labels in benchmakrs. In that case, for the few works which incorporated additional expensive cross-lingual supervision signals (especially Google Translate), it is hard to tell whether their reported "better performance" are due to translation, or are indeed due to some claimed new techniques in their works. As you have observed already, removing entity names and disabling MT have caused significant drop of performance by some of those systems.

For the above point 2, it is unfortunate to see that a few other more recent works are (what we believe, errorneously) following such an unfair evaluation protocol, for which we definitely suggest against. In fact, a few other studies have already realized this issue and have set good examples to separate w/ and w/o MT into two evaluation settings (e.g. the HMAN and MRAEA papers). And some works have also explicitly pointed out this issue (e.g., AttrGCN, JEANS and EVA papers). We will also continue to make further clarification of this fair comparison issue in future publications and release of OpenEA versions.

Note: The above issue only applied to the cases of cross-lingual EA. For monolingual EAs where training monolingual embeddings or directly comparing entity names are without any need of cross-lingual training labels, using entity names do not violate fair comparison. Although it is definitely worthy to examine how well a system could perform without the presence of entity names and with only the structural information. since in lots of KBs (especially bio-med ones), there might not be meaningful entity names.

-Muhao

muhaochen avatar Jan 27 '21 20:01 muhaochen