web-monitoring-processing
web-monitoring-processing copied to clipboard
Determine title from content if `<title>` is missing
When importing new versions of HTML pages (either from Wayback’s Memento API or from WARCs), we look for the page’s <title> element or use the empty string: https://github.com/edgi-govdata-archiving/web-monitoring-processing/blob/9c6a2cfed53c32e886ae16ce287878beffbf9622/web_monitoring/utils.py#L169-L180
There are a bunch of pages that turn out to be missing <title> elements, so it would probably be good to fall back to looking for the <h1> or some other title-like information in the page body.
-
Where present, the first
<h1>seems like a reasonable fallback. Examples:- https://api.monitoring.envirodatagov.org/api/v0/versions/05cd42f1-d995-446e-a33f-04ee2f96b6ad?different=false
- https://monitoring.envirodatagov.org/page/d1620a7d-557c-4517-89f7-53577d5d4e34/31ffa13d-b1ab-410b-b531-b8198db171bc..540fc862-3220-43b5-8a24-28f08a86554f
-
EPA’s LASSO adds some complexity here. The title is in an
<h1>, but another<h1>(the first one) is a link back to the EPA home page. Maybe look for<h1>that doesn’t contain a link to a different URL? -
Argonne Nat’l Labs has similar issues with the first
<h1>being a link to the home page: https://monitoring.envirodatagov.org/page/d617a0c4-27b7-4bad-a190-983e25cc1819/0230138d-361b-40aa-a65b-d610f5fbe3e5..d431a180-aa91-44fc-aec9-c8a3f2706aa9 -
“National Flood Hazard Layer (NFHL)” has no heading elements at all (in the HTML; it does after scripts run, which is… not great). However, it does have:
<span class="title">National Flood Hazard Layer (NFHL)</span>. So maybe looking for//*[contains(concat(' ',normalize-space(@class),' '),' title ')]is good? -
For plain text, maybe the first sentence of the first line? (example)
-
A lot of error pages have no title. Maybe use
<status code> <status text>(e.g. “404 Not Found”) in this case?