Adrien Barbaresi
Adrien Barbaresi
Thanks for the suggestions! I don't understand everything I change on Russian and Ukrainian but I tried to adapt suffix lists from the English Wiktionary, I think I took the...
Interesting thoughts, thanks. Sadly I lack the time to perform in-depth analyses of what's happening here, I look at the lemmatization accuracy and try to strike a balance. The newest...
Hi @1over137, thank you very much for the deep dive into the data and the build process! I cannot address the topic right now, will come back to it later,...
@szhengac Nothing new at the moment but the `alt` issue you're mentionning is only loosely related to this one, can you give an example?
Hi @kinoute, I think it could be because of a tag mismatch (malformed HTML) just before the text segments: `به زبانهای دیگر` It implies that all that follows is a...
@sepsi77 There are LXML-related issues on MacOS M1, M2 etc. (see also https://github.com/adbar/trafilatura/issues/166). Is it the platform you're using or can you provide more details?
Did you try building [LXML from source](https://lxml.de/build.html)?
Hi @kinoute, there must be something wrong in the way you encore or decode the HTML response, I cannot reproduce the bug: `trafilatura -u "http://sport.kurganobl.ru/8980.html" --json` works on my computer.
@sepsi77 Please note that brew can now be used to install Trafilatura on MacOS in a seamless way: https://formulae.brew.sh/formula/trafilatura
Hi @hugoobauer, this problem is also mentioned in #432. The problem with taking all article elements is that sometimes they are related content and not main content (e.g. a list...