Adrien Barbaresi
Adrien Barbaresi
I agree, it is not about extracting this bunch of data correctly. It would be best not to start an endless loop on it though, using at least a CPU...
That sounds good, please work on it if you have the time!
OK, I understand, I'll see what I can do.
Hi, just a quick evaluation on my side: - [WiLI-2018](https://zenodo.org/record/841984) dataset (Wikipedia sentences, so pretty regular, rather short input, noisy with named entities) - A few Germanic languages not too...
@adulau I assume osma's comment answered your question. @osma As you say mypyc can be used locally but I didn't enable it in the package release. I confirm the open...
Thanks @pemistahl for the detailed evaluation! I also like the bar plots you made to compare the results by language. A quick remark on the methodology, you write that "a...
Thanks for your answer, I've added JSON to [trafilatura](https://github.com/adbar/trafilatura) and will check if I can write a straightforward PR.
Hi @lopuhin, here is another tool that could be added: [Mercury Parser](https://github.com/postlight/mercury-parser). (source: https://github.com/adbar/trafilatura/issues/114)
Hi @lopuhin, just a quick follow-up: the benchmark could also be updated using the latest versions of the tools, see for instance the issue https://github.com/adbar/trafilatura/issues/156.
Hi again, thanks for the suggestions! I could derive lemmatization data from the universal dependency treebank. The first corpus doesn't look as good as it could lead to wrong word...