nowcasting_dataset
nowcasting_dataset copied to clipboard
Move our data download & conversion scripts into separate repos (preparing for production)
Our code for downloading and converting satellite data lives in the Satip repo.
Our code for downloading and converting other data sources (NWP, GSP PV, etc.) lives in this repo.
Are we happy with this situation?
Or should we have separate repos for downloading and preparing each data source? Or maybe just for the "big" data sources (e.g. NWPs? And GSP?).
Advantages of splitting out the downloading-and-converting scripts into separate repos:
- Reduces number of dependencies required by nowcasting_dataset
- Makes it a little easier for other people to re-use our downloading-and-converting code for other purposes
Disadvantages:
- Managing multiple repos can get time-consuming!
- Some of the data sources literally just require a single conversion script (e.g. converting the PassivSystems PV data). Do we really want a whole repo for a single conversion script?
What do you guys think, @jacobbieker and @peterdudfield?
Possibly a repo for all the download scripts, but I wouldn't really separate them all out into separate repos, as yeah, it gets a bit confusing keeping everything in sync. Would it reduce a lot of dependencies? I would think a decent amount of the dependencies would still need be used for the nowcasting-dataset
It we look ahead to the live system, we a service for data source, in that case I think it will be good to have separate repos. There will be docker files, and separate tests, and infrastructure code, that are handled much easier in smaller chunks. Also versioning is a bit easier then., i.e we can update one data source at a time.
I would be tempted to leave them here for the moment, until we start building out these data_source 'consumers', and then that'll be a good time to separate things?
It we look ahead to the live system, we a service for data source, in that case I think it will be good to have separate repos. There will be docker files, and separate tests, and infrastructure code, that are handled much easier in smaller chunks. Also versioning is a bit easier then., i.e we can update one data source at a time.
I would be tempted to leave them here for the moment, until we start building out these data_source 'consumers', and then that'll be a good time to separate things?
Yeah, that's a good point. So maybe leave this for later then? Also, if each one is then bundled with infra code, docker files, etc. it makes more sense as separate repos rather than currently where it would just be single files for a lot of them.
It we look ahead to the live system, we a service for data source, in that case I think it will be good to have separate repos.
Ooh, that's an excellent point!
Cool. As you've both suggested, I agree that we should leave our data conversion scripts where they are for the remainder of 2021.
I've moved this issue into the "WP2" project.
Good use of https://github.com/orgs/openclimatefix/projects/6