Adrien Barbaresi
Adrien Barbaresi
Hi @Lucabenj, could you please share an example?
Extraction bugs in text and metadata can be listed here as in https://github.com/adbar/htmldate/issues/8 where issues specifically related to dates should be reported. - [ ] https://web.archive.org/web/20211229181034/https://research.checkpoint.com/2021/a-deep-dive-into-doublefeature-equation-groups-post-exploitation-dashboard/ - [ ] https://web.archive.org/web/20220217023331/https://thehill.com/homenews/senate/594044-sen-lujan-to-return-to-senate-in-time-to-vote-for-supreme-court-nominee...
Yes, I think the issues in the document you mention are related to deleted `` sections.
Hi @karlkovaciny, the cutting-edge version from the repository is slightly better, it outputs the article but still includes garbled javascript. That's definitely a case to watch for. EDIT: for the...
Suggested in #208: - Text below the article: https://web.archive.org/web/20220513095359/https://www.vivereancona.it/2022/05/13/ubriaco-cammina-lungo-la-flaminia-quando-vede-arrivare-i-soccorsi-si-getta-nei-cespugli-e-fugge/2100180067/ - Few text lost in false positives (other articles): https://web.archive.org/web/20220513100231/https://www.vallesabbianews.it/notizie-it/%28Ro%C3%A8-Volciano%29-Davide-Nedrotti-campione-del-mondo-60595.html - Text above and below: https://web.archive.org/web/20220513100330/https://www.quinewsfirenze.it/firenze-terremoto-chianti.htm
Hi @felipehertzer, I don't think I can reproduce the bug, which metadata fields do you mean exactly?
Hi @naftalibeder, all tests pass except for the dev version Python 3.11 which is experimental: https://github.com/adbar/trafilatura/actions/runs/1730002541 I don't understand what it is happening here, you're using Python 3.9.9? On which...
This is really strange, it could have something to do with lxml and its underlying XML library but I'm not sure. Please keep be posted if you find an explanation.
Could you try the underlying library LXML alone on the problem at hand? You open the file, load it, and try to perform an operation on the tree, here is...
Thanks, let's follow the resolution of the issue there.