web-monitoring-processing Determine title from content if `<title>` is missing

Determine title from content if `<title>` is missing

Open Mr0grog opened this issue 8 months ago • 0 comments

trafficstars

When importing new versions of HTML pages (either from Wayback’s Memento API or from WARCs), we look for the page’s <title> element or use the empty string: https://github.com/edgi-govdata-archiving/web-monitoring-processing/blob/9c6a2cfed53c32e886ae16ce287878beffbf9622/web_monitoring/utils.py#L169-L180

There are a bunch of pages that turn out to be missing <title> elements, so it would probably be good to fall back to looking for the <h1> or some other title-like information in the page body.

Where present, the first <h1> seems like a reasonable fallback. Examples:
- https://api.monitoring.envirodatagov.org/api/v0/versions/05cd42f1-d995-446e-a33f-04ee2f96b6ad?different=false
- https://monitoring.envirodatagov.org/page/d1620a7d-557c-4517-89f7-53577d5d4e34/31ffa13d-b1ab-410b-b531-b8198db171bc..540fc862-3220-43b5-8a24-28f08a86554f
EPA’s LASSO adds some complexity here. The title is in an <h1>, but another <h1> (the first one) is a link back to the EPA home page. Maybe look for <h1> that doesn’t contain a link to a different URL?
Argonne Nat’l Labs has similar issues with the first <h1> being a link to the home page: https://monitoring.envirodatagov.org/page/d617a0c4-27b7-4bad-a190-983e25cc1819/0230138d-361b-40aa-a65b-d610f5fbe3e5..d431a180-aa91-44fc-aec9-c8a3f2706aa9
“National Flood Hazard Layer (NFHL)” has no heading elements at all (in the HTML; it does after scripts run, which is… not great). However, it does have: <span class="title">National Flood Hazard Layer (NFHL)</span>. So maybe looking for //*[contains(concat(' ',normalize-space(@class),' '),' title ')] is good?
For plain text, maybe the first sentence of the first line? (example)
A lot of error pages have no title. Maybe use <status code> <status text> (e.g. “404 Not Found”) in this case?

Mar 05 '25 03:03 Mr0grog

web-monitoring-processing web-monitoring-processing copied to clipboard

Determine title from content if `<title>` is missing

web-monitoring-processing
web-monitoring-processing copied to clipboard