pieterhartel
pieterhartel
@adbar wrote "The reason is that trailing titles at the bottom of articles are discarded during extraction". I don't think that this is the case. There is text following the...
I now see that this is the same as issue #7. Would it be possible to give this some priority? --pieter
Does this mean that with the current system scanning .onion sites is impossible?
Brilliant workaround, thanks!
Well, I tried to summarise what trafilatura is currently doing, could you confirm or correct please?
I have a dataset with over 10M home pages, of which 82K (less than 1%) contain a `@class="entry-title"`. WP is not popular in this dataset. 75K pages contain a ``,...
I suppose the question is why is this happening, what percentage of the certificates are missed and how to avoid the error... I would be very much interested in the...
I am tempted but how do I debug existing or new REs?
To explain my question, I have added a parser class for Belgium as follows: ``` class WhoisBe(WhoisEntry): """Whois parser for .be domains""" regex: dict[str, str] = { "domain_name": r"Domain: *(.+)",...