data-science
data-science copied to clipboard
BuiltWith: add API-based scraper (no Selenium)
Status (WIP)
New script 311-data/webscraping/builtwith_api_scrape.py (BuiltWith/RapidAPI path).
Reads URLs from the wide NCsurvey.csv (“NC URL (if avail)” row) and iterates all 97 domains.
Current stop point: completes the lookups but crashes when building the CSVs → KeyError: ['technology']. Why: Free BuiltWith/RapidAPI response doesn’t always include technologies in the fields our parser expected. Next steps:
Make parser tolerant of multiple response shapes.
Guard output step so it writes empty CSVs when no rows are returned.
(If org has a full BuiltWith API key, run again to populate tech tables.)
Update (Oct 20, 2025): HTML heuristic refresh
Re-ran widget_probe_min.py (no API) across all 97 NC sites.
Current widget counts:
has_calendar: 63 has_chatbot: 2 has_search: 19 has_translation: 13
Spot checks (first few domains per category):
- Calendar: atwatervillage.org, babcnc.org, bhnc.net, canndunc.org, chnc.org, centralsanpedro.org, chatsworthcouncil.org, cspnc.org
- Chatbot: ncwpdr.org, whcouncil.org
- Search: chnc.org, dlanc.com, echoparknc.com, glassellparknc.org, cypressparknc.com, greaterwilshire.org, hcnnc.org, hhwnc.org
- Translation: babcnc.org, myevrnc.com, cypressparknc.com, hcnnc.org, marvista.org, nohowest.org, prnc.org, soronc.org
Notes:
-
Heuristic looks for common calendar/chat/search/translation markers in HTML.
-
The script doesn’t need API keys and doesn’t depend on third-party services. Anyone can rerun it quickly on their machine to reproduce your numbers.
-
Future work: compare what our HTML rules detect against BuiltWith’s widgets group/categories for the same domains. That tells us how complete/accurate our heuristic is and where it misses.