Adrien Barbaresi
Adrien Barbaresi
Thanks for the further details, there is a mismatch in the way formatting and links and handled here. At first sight I'm not sure which part of the code to...
I tried to isolate the problem, does that replicate it efficiently enough? The title of the novel is misplaced (as you say) and the paragraph gets broken in two. Input:...
I cannot reproduce the bug, I just tried on the command-line and both the basic extraction and your options work for me: - `trafilatura -u "https://www.enpass.io/privacy-notice/"` - `trafilatura -u "https://www.enpass.io/privacy-notice/"...
I'm now against it, Trafilatura can focus on extraction and navigation on live web pages and leave the rest to the users. There are nice packages to interact with web...
It turns out I actually use `is_known()` so it was a mistake to alter its functioning. It can be useful to know if a token is a dictionary word or...
Let's work on your PR now (I just had one comment) and break the other points apart in new issue threads (changes in dictionary pickler and tests for binary string).
Yes, it would be better, I'll work on this.
I'll document it later in a readme in `training/` but here are a few conditions: - The data is present in two columns: `lemma TAB word form` - Redundant and...
@juanjoDiaz Yes, I'll work on the repository in October.
Hi, I'm working on something else at the moment but I still plan to work on it.