Patrice Lopez
Patrice Lopez
For instance using French Wikipedia (1B words) + FrWac corpus (1.6B words) As a reference, it will require 2-3 GeForce GTX 1080Ti.
For ELMo, which is using reduced batch size because of memory constrains, it might be necessary to review how the batch are created to ensure that rare classes are well...
Branch 0.0.3 contains a corpus-based evaluation together most of the usual NED corpora (ACE, AQUAINT, AIDA-CONLL, MSNBC, ...). However, it would be good to plug the tool on GERBIL for...
lmdbjava is apparently better maintained (more features & more OS built) and faster... also never get the zero copy mode working reliably with lmdbjni so it is worth trying lmdbjava...
Wikipedia redirects and anchors cover most of the frequent morphosyntactic variants (e.g. plurial), but not in an exhaustive manner - we coud add a process (or pre-process) to support them.
See the Java client written in anHALytics-core as starting point (the multithreaded version): https://github.com/anHALytics/anhalytics-core/blob/master/anhalytics-annotate/src/main/java/fr/inria/anhalytics/annotate/services/NerdService.java https://github.com/anHALytics/anhalytics-core/blob/master/anhalytics-annotate/src/main/java/fr/inria/anhalytics/annotate/Annotator.java
The Wikidata dump became very big with 1.2 billion statements which makes the initial loading of the bz2 dump into lmdb particularly slow. To speed-up this step, we could try:...
Some disambiguation fails for terms present in Wikidata (as label) because there are no usage information in the wikipedia of this target language. The difficulty is that without any statistical...
It would be good to have the lmdb map a bit more dynamic to select the right weight encoding based on the actual value range, so that the mechanism can...
The production of the training data is currently single threaded and really slow for the selection model. A straightforward improvement is to use several workers for this, as this task...