Adrien Barbaresi comments

Results 319 comments of


                                            Adrien Barbaresi

Extraction does not terminate

I agree, it is not about extracting this bunch of data correctly. It would be best not to start an endless loop on it though, using at least a CPU...

Extraction does not terminate

That sounds good, please work on it if you have the time!

Duplicate text output

OK, I understand, I'll see what I can do.

Hi, just a quick evaluation on my side: - [WiLI-2018](https://zenodo.org/record/841984) dataset (Wikipedia sentences, so pretty regular, rather short input, noisy with named entities) - A few Germanic languages not too...

Replace pycld3 dependency?

@adulau I assume osma's comment answered your question. @osma As you say mypyc can be used locally but I didn't enable it in the package release. I confirm the open...

Replace pycld3 dependency?

Thanks @pemistahl for the detailed evaluation! I also like the bar plots you made to compare the results by language. A quick remark on the methodology, you write that "a...

Adding more tools to the benchmark?

Thanks for your answer, I've added JSON to [trafilatura](https://github.com/adbar/trafilatura) and will check if I can write a straightforward PR.

Adding more tools to the benchmark?

Hi @lopuhin, here is another tool that could be added: [Mercury Parser](https://github.com/postlight/mercury-parser). (source: https://github.com/adbar/trafilatura/issues/114)

Adding more tools to the benchmark?

Hi @lopuhin, just a quick follow-up: the benchmark could also be updated using the latest versions of the tools, see for instance the issue https://github.com/adbar/trafilatura/issues/156.

Support for Northern Sami language

Hi again, thanks for the suggestions! I could derive lemmatization data from the universal dependency treebank. The first corpus doesn't look as good as it could lead to wrong word...