data-science icon indicating copy to clipboard operation
data-science copied to clipboard

BuiltWith: add API-based scraper (no Selenium)

Open Adaakal opened this issue 2 months ago • 1 comments

Status (WIP)

New script 311-data/webscraping/builtwith_api_scrape.py (BuiltWith/RapidAPI path).

Reads URLs from the wide NCsurvey.csv (“NC URL (if avail)” row) and iterates all 97 domains.

Current stop point: completes the lookups but crashes when building the CSVs → KeyError: ['technology']. Why: Free BuiltWith/RapidAPI response doesn’t always include technologies in the fields our parser expected. Next steps:

Make parser tolerant of multiple response shapes.

Guard output step so it writes empty CSVs when no rows are returned.

(If org has a full BuiltWith API key, run again to populate tech tables.)

Adaakal avatar Sep 16 '25 05:09 Adaakal

Update (Oct 20, 2025): HTML heuristic refresh

Re-ran widget_probe_min.py (no API) across all 97 NC sites.

Current widget counts:

has_calendar: 63 has_chatbot: 2 has_search: 19 has_translation: 13

Spot checks (first few domains per category):

  • Calendar: atwatervillage.org, babcnc.org, bhnc.net, canndunc.org, chnc.org, centralsanpedro.org, chatsworthcouncil.org, cspnc.org
  • Chatbot: ncwpdr.org, whcouncil.org
  • Search: chnc.org, dlanc.com, echoparknc.com, glassellparknc.org, cypressparknc.com, greaterwilshire.org, hcnnc.org, hhwnc.org
  • Translation: babcnc.org, myevrnc.com, cypressparknc.com, hcnnc.org, marvista.org, nohowest.org, prnc.org, soronc.org

Notes:

  • Heuristic looks for common calendar/chat/search/translation markers in HTML.

  • The script doesn’t need API keys and doesn’t depend on third-party services. Anyone can rerun it quickly on their machine to reproduce your numbers.

  • Future work: compare what our HTML rules detect against BuiltWith’s widgets group/categories for the same domains. That tells us how complete/accurate our heuristic is and where it misses.

Adaakal avatar Oct 21 '25 01:10 Adaakal