the_od_bods icon indicating copy to clipboard operation
the_od_bods copied to clipboard

Replace .csv in usmart api

Open KarenJewell opened this issue 2 years ago • 13 comments

Is your feature request related to a problem? Please describe. Currently, the usmart api script (usmart.py) outputs to a .csv which is then ingested into merge_data.py. This should be converted to a .json output to keep consistent with the rest of the project.

Describe the solution you'd like

  • [ ] Change the output of usmart.py to be a .json file
  • [ ] Update merge_data.py to ingest the new .json file
  • [ ] format code with Black before PR
  • [ ] update docs (architecture.md and datapipeline.png)

Describe alternatives you've considered None

Additional context Original ticket triggering this change is #163

KarenJewell avatar Feb 18 '23 15:02 KarenJewell

@KarenJewell please assign

paul-bradbeer avatar Oct 15 '23 16:10 paul-bradbeer

@paul-bradbeer It's all yours! Don't forget to check out our docs at https://docs.opendata.scot and don't hesitate to give us a shout here or on Slack if you have any questions.

JackGilmore avatar Oct 15 '23 17:10 JackGilmore

@JackGilmore @KarenJewell Good docs - thanks. Also the walkthrough video crystallised things clearly. I've been preoccupied of late, but I will have more time in November. In the meantime, it strikes me that there are a number of these csv to json type conversions and so it seems likely the first one done will serve as a pattern for imitation to some degree - perhaps some common machinery will benefit. JSON will be slightly bulkier than CSV, but I don't think the data volume is sufficient for this to be a concern - let me know if you disagree. Standardising on JSON will also allow for easier representation of hierarchical structure if needed.

paul-bradbeer avatar Oct 27 '23 07:10 paul-bradbeer

@paul-bradbeer Good shout. I did create a write_json method for processor.py when I was writing the scraper for the Scottish Parliament portal (scottish_parliament.py) which is probably an OK start but it really just takes the list of dict objects I created and json.dumps them.

Ideally once I get time (or if someone gets there before me!) I'd like to declare some proper classes for the JSON schema we want to use with the aim of saving people from having to write their own all the time.

JackGilmore avatar Oct 29 '23 13:10 JackGilmore

@JackGilmore JSON schema good if they all look the same or are very similar; just checking that now.

paul-bradbeer avatar Nov 04 '23 08:11 paul-bradbeer

@paul-bradbeer Can't find your comment anymore but I've updated the Contributor Guide with better guidance for forking and creating a branch. When we wrote the docs we must've been viewing the pages as members of the OpenDataScotland GitHub org meaning we've got slightly more elevated permissions to write directly to the repo compared to contributors that will need to fork instead 🤦‍♂️ https://docs.opendata.scot/contribute/contribution-guide/

JackGilmore avatar Nov 05 '23 13:11 JackGilmore

@JackGilmore removed it once I figured out that what I was commenting on was out of date. Forking in the usual way is fine.

paul-bradbeer avatar Nov 05 '23 18:11 paul-bradbeer

@paul-bradbeer Kept going one some work with the common schema stuff and built some common classes for datasets and file resources in the branch I've got at the moment for building a Stagecoach scraper as part of #120

Can see more info here if you fancy a look:

  • Classes file in branch: https://github.com/OpenDataScotland/the_od_bods/tree/120_stagecoach_scraper/classes
  • Open PR for new scraper: https://github.com/OpenDataScotland/the_od_bods/pull/258

JackGilmore avatar Nov 11 '23 21:11 JackGilmore

@JackGilmore Today I am submitting my pull request for the completed works, including a dataclass object as a sort of schema.

I can't see how this process works if issues are allocated to people and then others go off and work on the same thing. It seems to be a very inefficient way to do things resulting in duplication.

I will submit my pull request; do with it as you please but my spare time is nearly up.

The solution is a pattern for immitation with a Json and Csv decorator class - so folks can see which are json and which are csv It allows a mix and match approach.

paul-bradbeer avatar Nov 12 '23 16:11 paul-bradbeer

@JackGilmore Draft pull request #260 created. See what you make of it.

paul-bradbeer avatar Nov 12 '23 16:11 paul-bradbeer

@JackGilmore the pull request doesn't seem to be linked. Are you able to link pull request 260 to this issue; I don't seem to be able to do that.

paul-bradbeer avatar Nov 14 '23 13:11 paul-bradbeer

@paul-bradbeer Thanks for your PR and apologies for the delay in getting back to this. The last couple of weeks have been a tad busy.

Please also accept my apologies if you thought I was duplicating your work. I had been prototyping a few things for the pipeline and hadn't realised that you were working on a common decorator class as part of your work. Nevertheless, your classes actually look a lot more organised and make better sense than what I was playing about with, so thank you! Your contribution is greatly appreciated and we're always looking for feedback on how we can improve the contribution process so I'll take this on board and see how we can update the way we coordinate issues and PRs.

I'm aiming to try to review this by the end of next week.

JackGilmore avatar Nov 25 '23 16:11 JackGilmore

@JackGilmore Thanks. I wasn't sure how much overlap there would be and I probably should have checked your PR to see but I was running out of spare time. Hopefully you can make some use of it.

paul-bradbeer avatar Nov 26 '23 19:11 paul-bradbeer