openWordnet-PT icon indicating copy to clipboard operation
openWordnet-PT copied to clipboard

check consistency

Open arademaker opened this issue 8 years ago • 2 comments

  1. Check PWN data agains http://wordnet.princeton.edu/wordnet/man/wnstats.7WN.html to make sure we did not lose anything.
  2. repeat the consistency check in the RDF
  3. Apply the Francis Bond's patch below and update the wn30.ttl

just to follow up on this, currently, if you exclude domains, there are 5 entries in PWN 3.0 where are two relations, all arguably unnecessary, and one a known bug. These are all fixed in PWN 3.1. We will add a test for this in the open multilingual wordnet.

In three cases there is both an 'also_see' and a 'similar_to', and we should just keep the 'similar_to'. Synset('inattentive.a.01'): forgetful.s.03 also_sees forgetful.s.03 similar_tos

Synset('chromatic.a.03'): chestnut.s.01 also_sees chestnut.s.01 similar_tos

Synset('fertile.a.01'): conceptive.s.01 also_sees conceptive.s.01 similar_tos

In one case we have both an 'entailment' and a 'hypernym', and we should just keep the 'hypernym'.

Synset('breathe.v.01'): inhale.v.02 entailments inhale.v.02 hyponyms

And the bug: 'restrain' is both its own 'hypernym' and 'hyponym' . Synset('restrain.v.01'): inhibit.v.04 hypernyms inhibit.v.04 hyponyms

If you also allow domains, then there are quite a few more (61), e.g.

Synset('knock_on.n.01'): play.n.03 hypernyms rugby.n.01 part_holonyms rugby.n.01 topic_domains

Synset('ball_game.n.01'): baseball.n.01 hyponyms baseball.n.01 topic_domains field_game.n.01 hypernyms

Synset('bioterrorism.n.01'): terrorism.n.01 hypernyms terrorism.n.01 topic_domains

I attach the full list of synsets with duplicates (including domains).

P.S. Here is the script used to detect these:

from nltk.corpus import wordnet as pwn

# relations with domains
#relations = ['also_sees', 'attributes', 'causes', 'entailments',
'hypernyms', 'hyponyms', 'instance_hypernyms', 'instance_hyponyms',
'member_holonyms', 'member_meronyms', 'part_holonyms',
'part_meronyms', 'region_domains', 'similar_tos',
'substance_holonyms', 'substance_meronyms', 'topic_domains',
'usage_domains']

# relations without domains
relations = ['also_sees', 'attributes', 'causes', 'entailments',
'hypernyms', 'hyponyms', 'instance_hypernyms', 'instance_hyponyms',
'member_holonyms', 'member_meronyms', 'part_holonyms',
'part_meronyms', 'similar_tos', 'substance_holonyms',
'substance_meronyms']

for s  in pwn.all_synsets():
   ttt = []  # everything linked to (synset, relation)
   for r in relations:
       tt = getattr(s,r)()
       ttt += [(t,r) for t in tt]
   ### check for duplicates in just synset
   justt = [t  for (t,r) in ttt]
   if len(justt) > len(set(justt)):
       print ("{}:\n{}\n\n".format(str(s),
                                   "\n".join(["{}\t{}".format(t.name(),r)
                                              for (t,r) in sorted(ttt)])))

arademaker avatar May 25 '16 08:05 arademaker

dupl-rel-pwn30.txt

More at https://lists.princeton.edu/cgi-bin/wa?A2=ind1603&L=wn-users&P=R86&1=wn-users&9=A&J=on&d=No+Match%3BMatch%3BMatches&z=4

arademaker avatar May 25 '16 08:05 arademaker

http://www.swi-prolog.org/pldoc/man?section=SyntaxAndSemantics

Podemos usar para verificar consistência do rdf ? @fcbr suggestion

arademaker avatar May 08 '17 23:05 arademaker