concepticon-data Mapping items previously unmapped when a new concept is added

When a new concept is added to Concepticon, it would be necessary to look into all unmapped glosses of other concept lists and check if the new concept doesn't cover one or more of such items. Examples are large numbers (which are now more frequent, especially with the inclusion of conceptlists from psychology and not only historical linguistics) and cultural artifacts/events (e.g., when a given festival was deemed too specific when the first list of its geographic area was added to Concepticon, but which turns out to be a recurrent entry in other word lists based in the same area).

This is related to semi-automatic mapping and coverage analysis in general, last time discussed here.

Apr 20 '18 06:04 tresoldi

Yes, good point. But also a strategy for getting rid of NANs would be useful, i.e., run through all NANs (unlinked glosses) and search for linkable things, etc. This would then be part of our CONTRIBUTING.md where we describe release procedures.

Apr 20 '18 06:04 LinguList

For milestone 2.5, we should make a list of current nans and see what we can do here.

May 26 '21 11:05 LinguList

That's what

concepticon notlinked

is for.

May 27 '21 15:05 xrotwang

Among the first results are already some good candidates for linking, I think:

2 Anonby-2018-1500-214: abstain: ABSTAIN FROM FOOD [303]
3 Anonby-2018-1500-244: obstruct: BLOCK (THE WAY) [2762]

May 27 '21 15:05 xrotwang

Btw.: running concepticon notlinked is also part of the release procedure, see https://github.com/concepticon/concepticon-data/blob/master/RELEASING.md

May 27 '21 15:05 xrotwang

Running ... notlinked takes quite some time (hours?). I'll let that run locally, and will post the results here for someone to determine which links should be made, and which ones discarded.

May 27 '21 15:05 xrotwang

Why does it take so long? Is it maybe because of the all vs. all mapping procedure used there?

May 27 '21 21:05 LinguList

Yep, it is because of the search strategy, because it uses the lookup command, which loads a lot of data upon each run.

May 27 '21 21:05 LinguList

To avoid this, the code can be modified to do the lookup with the data loaded once. I can look into that tomorrow. If this runs fast, one can do it more regularly.

May 27 '21 21:05 LinguList

It isn't horribly slow. Took about 3 hours. That would be ok for running once per release, I'd say. Results attached notlinked.txt

May 28 '21 04:05 xrotwang

I guess to simplest way to speed it up would be to only look at new concept sets, specified via some ID cut-off.

May 28 '21 04:05 xrotwang

Such a Concepticon_ID cut-off may also be the best way to make the result easier to interpret. Looking at it now, the potential matches for new conceptsets (i.e. ones with ID > 3000) seem much to judge (often as correct matches) than the matches with lower IDs - which are mostly just the aggregation of false positives from the last couple of years.

May 28 '21 04:05 xrotwang

Just implemented the following: For each concept list, I limit the search space of potential matches to concept sets with ID bigger than the highest CONCEPTICON_ID linked in the list (assuming all lower ID concept sets were already available when the list we included). This leads to

$ time concepticon notlinked
INFO    concepticon/concepticon-data at /home/robert_forkel/projects/concepticon/concepticon-data
1 BeijingDaxue-1964-905-169: pomelo: GRAPEFRUIT OR POMELO [3804]
2 Blomberg-2020-160-101: love: LOVE (AFFECTION) [3834]
3 Chen-2019-61-59: astringent: ASTRINGENT [3837]
4 Dellert-2017-1016-822: put: PUT (IN A SITTING POSITION) [3832]
5 Desrochers-2010-330-167: drop: DROP (OF A LIQUID) [3748]
6 Desrochers-2010-330-168: pomegranate: POMEGRANATE [3732]
7 Gauchat-1925-480-74: to stumble: STUMBLE [3617]
8 Gauchat-1925-480-102: lost: LOST [3667]
9 Gauchat-1925-480-339: know: KNOW [3626]
10 Gauchat-1925-480-448: saw: SAW (SOMETHING) [3543]
11 Hartmann-2013-162-130: SCREAM: SCREAM (PRODUCE A CRY) [3809]
12 Key-2016-1310-950: calm (of sea): CALM (OF SEA) [3820]
13 Luniewska-2016-299-142: tie: TIE (NOUN) [3711]
14 Luniewska-2019-299-142: tie: TIE (NOUN) [3711]
15 Maciejewski-2016-100-104: pet: PET [3780]
16 Sawka-2019-201-140: wall: WALL [3830]
17 Stoll-1884-259-185: avocado: AVOCADO [3725]
18 Voorhoeve-1971-125-109: hornbill: HORNBILL [3260]
19 Wu-2020-150-122: close: NEAR (IN SPACE) [3735]

real	1m10,527s
user	1m10,336s
sys	0m0,180s

May 28 '21 06:05 xrotwang

I'd say we keep both variants of the lookup, and run the "full" search only per release.

May 28 '21 06:05 xrotwang

Here's the implementation: https://github.com/concepticon/pyconcepticon/commit/8f1b0336aa4f2c9dc5d80b4e9a5471166f20fd3b

May 28 '21 06:05 xrotwang

Thanks for updating the code! I've been running the notlinked periodically and picked bits and pieces of clear cases. I'll add the additional mappings you've provided.

May 28 '21 07:05 chrzyki

I looked into this, but it still takes some time, and I wonder why, since the api uses some kind of a cash, so if you call a mapping once, it will actually check for it, if it has been loaded. But it turns out, that this is not done for some reason (or that would be what I assume, that the dictionary is not checked somehow), since this could explain why it takes so long to make the lookup check all words at once for a concept list:

    for _, cl in sorted(args.repos.conceptlists.items(), key=lambda p: p[0]):
        print(cl.id)
        concepts = []
        for concept in sorted(
                cl.concepts.values(),
                key=lambda p: int(re.match('([0-9]+)', p.number).groups()[0])):
            if not concept.concepticon_id:
                concepts += [concept]
        print('found {0} concepts'.format(len(concepts)))
        for i, matches in enumerate(args.repos.lookup([concept.label for
                concept in concepts])):
            if matches:
                print("{0} {1.id}: {1.label}: {2[0]} {2[1]}".format(
                    i, concepts[i], list(matches)[0][2:4]))

May 28 '21 10:05 LinguList

My comment was not referring to the new version which limits concepticon IDs. But I think it is still valid, since the bottleneck is the loading of the mapping data, even if you do not use the full search.

May 28 '21 10:05 LinguList

I found the problem, @xrotwang. The functions in pyconcepticon.api confuse concept_map2 (the slow all-to-all search) with concept_map (the fast search). The default is "full_search=False", but the line(s):

        cfunc = concept_map if full_search else concept_map2

should be reversed:

        cfunc = concept_map2 if full_search else concept_map

If this is done, the search is just without a problem.

May 28 '21 10:05 LinguList

And one more update:

def run(args):
    i = 0
    concepts = []
    for _, cl in sorted(args.repos.conceptlists.items(), key=lambda p: p[0]):
        for concept in sorted(
                cl.concepts.values(),
                key=lambda p: int(re.match('([0-9]+)', p.number).groups()[0])):
            if not concept.concepticon_id:
                concepts += [concept]
    for j, matches in enumerate(args.repos.lookup([c.label for c in concepts])):
        if matches:
            candidates = sorted(matches, key=lambda x: x[-1])
            cid, cgl = candidates[0][2:4]
            i += 1
            print('{0} {1.id}: {1.label}: {2} [{3}]'.format(i, concepts[j], cid,
                cgl))

This assembles data one time, which makes the mapping also faster.

May 28 '21 10:05 LinguList

Should I make a PR on this for pyconcepticon?

May 28 '21 10:05 LinguList

Hm. Can't really confirm your findings. Passing full_search=True in the notlinked command should have the same effect, i.e. choosing the other mapping function. But this doesn't result in noticeable speedups.

May 28 '21 10:05 xrotwang

But the functions ARE swapped. the cmap2 is not the full search, the other one is in glosses.py.

May 28 '21 10:05 LinguList

Also, the line you mentioned already looks like you suggest: https://github.com/concepticon/pyconcepticon/blame/8f1b0336aa4f2c9dc5d80b4e9a5471166f20fd3b/src/pyconcepticon/api.py#L285

May 28 '21 10:05 xrotwang

And with the function change above I get:

$ time concepticon notlinked
real	0m9,610s
user	0m9,462s
sys	0m0,134s

And I get some 4000 entries:

$ concepticon notlinked | wc
   4094   25834  215324

May 28 '21 10:05 LinguList

Ah, what happened in my version then, sorry!

May 28 '21 10:05 LinguList

But the strategy of assembling all concepts first, and then comparing against concepticon with one big lookup should be indisputable, right?

May 28 '21 10:05 LinguList

yes, I guess that makes sense. Let me try to fold that into the code.

May 28 '21 11:05 xrotwang

Super! One more idea: if we allow to specify a concept list, we could use this even to check before submitting a new list, if all possible cases have been accounted for. So users could post that along with it. Or is that leading too far?

May 28 '21 11:05 LinguList

How long does the command (with your changes) take on your machine?

May 28 '21 11:05 xrotwang