pieterhartel

Results 13 comments of pieterhartel

@adbar wrote "The reason is that trailing titles at the bottom of articles are discarded during extraction". I don't think that this is the case. There is text following the...

I now see that this is the same as issue #7. Would it be possible to give this some priority? --pieter

Does this mean that with the current system scanning .onion sites is impossible?

Well, I tried to summarise what trafilatura is currently doing, could you confirm or correct please?

I have a dataset with over 10M home pages, of which 82K (less than 1%) contain a `@class="entry-title"`. WP is not popular in this dataset. 75K pages contain a ``,...

I suppose the question is why is this happening, what percentage of the certificates are missed and how to avoid the error... I would be very much interested in the...

I am tempted but how do I debug existing or new REs?

To explain my question, I have added a parser class for Belgium as follows: ``` class WhoisBe(WhoisEntry): """Whois parser for .be domains""" regex: dict[str, str] = { "domain_name": r"Domain: *(.+)",...