the_od_bods icon indicating copy to clipboard operation
the_od_bods copied to clipboard

Stagecoach scraper

Open JackGilmore opened this issue 2 years ago • 1 comments

[!IMPORTANT]
Still in progress. Sharing draft with community

Outstanding tasks

  • [ ] Contact Stagecoach to ask them to confirm dataset licence and if they can fix a malformed HTML tag on their website
  • [x] Add datasets into merge_data.py
  • [ ] Test full pipeline and check how datasets appear on frontend

Description

  • Stagecoach scraper using new JSON scraper format
  • Adds new common methods to processor.py for HTML scraping too
    • get_html
    • get_html_head
    • get_http_content_length

Motivation and Context

Closes #120

How Has This Been Tested?

Ran script file locally and produces appropriate dataset file in data\bespoke_Stagecoach\Stagecoach.json

Screenshots (if appropriate):

Types of changes

  • [ ] Bug fix (non-breaking change which fixes an issue)
  • [X] New feature (non-breaking change which adds functionality)
  • [ ] Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • [X] My code follows the code style of this project.
  • [X] My change requires a change to the documentation.
  • [ ] I have updated the documentation accordingly.
  • [X] I have read the CONTRIBUTING document.
  • [ ] I have added tests to cover my changes.
  • [ ] All new and existing tests passed.

JackGilmore avatar Oct 29 '23 15:10 JackGilmore

Contacted Stagecoach's open data email address via [email protected] to query license and metadata

JackGilmore avatar Nov 03 '23 20:11 JackGilmore