Adrien Barbaresi
Adrien Barbaresi
@hugoobauer Your idea looks good. The length heuristic would have to run on whole `` elements and I'm not sure how. In any case, feel free to draft a pull...
This goes beyond the scope of the software, closing the issue for now.
I can confirm that the issue appears to be fixed.
@dideler Thanks, the year is detected correctly but not the month which is contained in a `` tag. I'll see what I can do.
@stevesong It already works: ``` $ htmldate -u "https://web.archive.org/web/20240111084001/https://www.capacitymedia.com/article/2b9xz33jdzmp2r77gr280/news/tizeti-expansion-to-boost-nigerian-connectivity" 2023-02-13 ```
It's not supposed to normalize anything, I'm just using archived versions to be able to replicate issues at some point in the future.
I guess it's because the download fails, there are websites which restrict access to the download utility, see https://trafilatura.readthedocs.io/en/latest/troubleshooting.html
It changed again, here is the new URL, we're nearly there in terms of coverage: https://app.codecov.io/gh/adbar/trafilatura/blob/master/trafilatura%2Fjson_metadata.py
@getorca Thanks for the hint, your function looks interesting, unfortunately HTML meta tags don't always correspond to the content, I believe it would be better to just apply language detection...
I'm not aware of such a benchmark (HTML lang vs. actual language) but I'd also be curious. Please keep me updated with the extraction benchmark, I'm interested! I can also...