projects NEL: embedding-based & fuzzy lexical candidate selection

Goals

Implement a working draft picking entity candidates based on

similarities in the embedding space between entity description and text
fuzzy lexical similarity between entity name and text

This description is WIP and will be updated.

Description

Selects entity candidates based on similarities to the entity vectors in the embedding space, with some postprocessing.

Types of change

New feature.

Checklist

[x] I confirm that I have the right to submit this contribution under the project's MIT license.
[ ] I ran the tests, and all new and existing tests passed.
[ ] My changes don't require a change to the documentation, or if they do, I've added all required information.

Jul 07 '22 07:07 rmitsch

Still running this, but ran into a few minor issues that need fixing.

Thanks for the feedback! I'm currently working on the Mewsli-9 dataset - will clean up when that's done (aiming to merge afterwards).

Jul 07 '22 08:07 rmitsch

Couple of small things:

The dataset parsing (mewsli, wiki parsing) doesn't seem to have outputs defined. Is that hard because they're large collections of files? Is there some placeholder we can use to get some basic caching?

delete_wiki_db deletes a file in the assets directory. It would help if there was a note about where that file was generated, or if it was listed as an output somewhere.

The README seems to be out of date and needs to be regenerated. Currently the README still includes a note about SVN that's kind of for us and should be rephrased so that users are clear it doesn't apply to them.

For some of the tasks it could help to note "this can take a long time" or similar.

One thing I'm not sure about is that one of your tasks calls pip install requirements.txt. I think generally we expect that to happen outside of the project flow, since you can't run the project file without spaCy installed. On the other hand I can see how having it in the project file can make updates simpler.

Jul 19 '22 09:07 polm

Couple of small things:

Sorry I didn't get back to you earlier! Totally missed this last comment. Cleaned up project.yml and the readme.

Outputs for the wiki parsing have been added.
Added a comment to delete_wiki_db. Also, the corresponding file is now listed as output of a previous step and as dependency for delete_wiki_db.
README was updated. SVN comment was rephrased as a note that SVN is required to download Mewsli-9.
Added some "can take a long time" comments.
Removed the dependency installation step.

Jul 26 '22 14:07 rmitsch

On a different note: running the entire project is obviously infeasible since the addition of the Wiki dump download and parsing. Other steps could be limited in their runtime rather easily, but we'd still need the downloaded complete Wiki dump, unfortunately. Unsure as to how to proceed here. Any suggestions?

Jul 26 '22 15:07 rmitsch

Can we directly provide the output of processing the dataset somehow? If hosting isn't difficult that would be easiest to set up.

A more principled thing would be to pick a subset, like "people born after 1900" or something. Picking a reasonable subset could be hard though.

Jul 27 '22 05:07 polm

Can we directly provide the output of processing the dataset somehow? If hosting isn't difficult that would be easiest to set up.

A more principled thing would be to pick a subset, like "people born after 1900" or something. Picking a reasonable subset could be hard though.

Yes, we could host the .sqlite file with the parsed data. In total it's around 14 GB. I'll look into figuring out a reasonable subset.

Jul 27 '22 07:07 rmitsch

On second thought, this would bypass testing the parsing script. It'd be better if we started with small Wiki dumps and went from there. One option is to

locally run through both dumps,
pick a limited number of entries - either randomly or in a more principled fashion,
copy them into a new archive,
upload the archive so it's publicly available,
in the test suite: download the smaller test dumps and run all steps (I guess I'd have to adjust project.yml for the test case).

This seems like a bother, but is perhaps the only way to test the entire project. Except @svlandeg has a better idea?

Jul 27 '22 07:07 rmitsch

I'm quite fond of having an alternative, much smaller dataset, even if it's perhaps unrealistically small, for testing purposes. Not just for the CI, but also to allow people to have a quick look & run through the project without first having to sit around for hours to wait for the download to finish.

We could think of a very narrow domain like Paul mentioned, and then your outline Raphael does sound sensible to me!

Jul 27 '22 11:07 svlandeg

Suggestion:

Pick every Wiki entity mentioning at least one of a set of seed terms related to astronomy (moon, star, galaxy, ...) in their description.
Parse the Wiki dump again (:smiling_face_with_tear:) and move these entities + their articles into two separate entities/articles archives.
Upload the archives (where to?) and make them publicly available.
Comment the original Wiki dump download links and replace them (with an appropriate notice) with links to the smaller test sets.

We could do something similar for the Mewsli dataset, but IIRC the legal situation around redistributing the data was unclear. We could download it in full and then filter it afterwards though with similar terms, otherwise we'd have a large non-astronomy corpus and only astronomical entities to link.

Jul 27 '22 14:07 rmitsch

I think filtering based on seed terms like that makes sense.

Upload the archives (where to?) and make them publicly available.

I think the default for this would usually be Github Releases, unless the data is too large or something.

Aug 01 '22 09:08 polm

I filtered by a list of European capitals now (testing the internally discussed sports teams/city names idea). The filtered Wikidata and Wikipedia dumps are 74 and 58 MB respectively. Are we ok with that or do we want a smaller set?

[Edit] European cities are a bad match for our US-centric corpora :upside_down_face: Going with US cities now. The question regarding an acceptable dataset size remains.

Aug 01 '22 09:08 rmitsch

Git LFS is another option, I guess? Feels kinda weird to me if the only "release" in the repo were the Wiki dump.

Aug 01 '22 09:08 rmitsch

Sorry for the late reply.

I filtered by a list of European capitals now (testing the internally discussed sports teams/city names idea). The filtered Wikidata and Wikipedia dumps are 74 and 58 MB respectively. Are we ok with that or do we want a smaller set?

Those sizes sounds great, easy to work with.

Git LFS is another option, I guess? Feels kinda weird to me if the only "release" in the repo were the Wiki dump.

Git LFS would also work but I think using it is pretty awkward, especially for data a user is only likely to use once, especially if we aren't very concerned about diffs.

For files <100MB you can just put them in git, though Github will warn that you should use LFS. We wouldn't want to do that in the main branch here though. One other thing I've heard of people doing is having a "data" branch, so normal checkouts are fast but data is still in git, though I think releases are still preferable since this won't be updated often.

Aug 10 '22 05:08 polm

Closing and reopening to trigger CI.

Aug 29 '22 13:08 rmitsch

Have you run this on the full unfiltered datasets? What were the results? How do the different candidate generators compare?

Not in the current version, but it's on my todo list.

I suppose the high Random F-score (88%) is because the filtered data has much less ambiguity than a realistic dump would have?

I'd assume so too, but haven't looked into it yet.

Sep 02 '22 10:09 rmitsch

projects projects copied to clipboard

NEL: embedding-based & fuzzy lexical candidate selection

Goals

Description

Types of change

Checklist

projects
projects copied to clipboard