Adrien Barbaresi comments

Results 412 comments of


                                            Adrien Barbaresi

List of smaller extraction bugs (text & metadata)

@hugoobauer Your idea looks good. The length heuristic would have to run on whole `` elements and I'm not sure how. In any case, feel free to draft a pull...

Bypass catchas/cookies/consent windows?

This goes beyond the scope of the software, closing the issue for now.

Are there any settings that allow us to make sure that the full article is scraped inspead of just the initial part of it?

I can confirm that the issue appears to be fixed.

Test htmldate on further web pages and report bugs

@dideler Thanks, the year is detected correctly but not the month which is contained in a `` tag. I'll see what I can do.

Test htmldate on further web pages and report bugs

@stevesong It already works: ``` $ htmldate -u "https://web.archive.org/web/20240111084001/https://www.capacitymedia.com/article/2b9xz33jdzmp2r77gr280/news/tizeti-expansion-to-boost-nigerian-connectivity" 2023-02-13 ```

Test htmldate on further web pages and report bugs

It's not supposed to normalize anything, I'm just using archived versions to be able to replicate issues at some point in the future.

Test htmldate on further web pages and report bugs

I guess it's because the download fails, there are websites which restrict access to the download utility, see https://trafilatura.readthedocs.io/en/latest/troubleshooting.html

Extend test coverage for json_metadata functions

It changed again, here is the new URL, we're nearly there in terms of coverage: https://app.codecov.io/gh/adbar/trafilatura/blob/master/trafilatura%2Fjson_metadata.py

Add document language to metadata

@getorca Thanks for the hint, your function looks interesting, unfortunately HTML meta tags don't always correspond to the content, I believe it would be better to just apply language detection...

Add document language to metadata

I'm not aware of such a benchmark (HTML lang vs. actual language) but I'd also be curious. Please keep me updated with the extraction benchmark, I'm interested! I can also...