Satip icon indicating copy to clipboard operation
Satip copied to clipboard

Add automated pipeline for pulling, processing, and saving to HF/GCP/Wherever

Open jacobbieker opened this issue 3 years ago • 1 comments

Detailed Description

Instead of relying on someone to manually run the data pipeline to get more satellite data, we should try automating it, so the dataset just grows on its own.

It could be something like with Prefect, or Airflow, use the current app, or something else? But adding support for that would probably be quite self-contained compared to the rest of the codebase

Context

I manually pull new data every once in awhile, but keeping it standardized and on a schedule would probably keep the data more up to date and easier to add more strict checks than me looking at random examples and the simple checks of no NaNs, etc.

Possible Implementation

We would want to make sure the data is sensible, so maybe add something like https://github.com/great-expectations/great_expectations to check data before its added to the Zarr store.

jacobbieker avatar Oct 27 '22 08:10 jacobbieker

Another great one to try would be to use Pangeo forge https://pangeo-forge.org/ which is used for a lot of NWP and climate data already and would seemingly work for all our datasets, other than maybe PV

jacobbieker avatar Nov 03 '22 15:11 jacobbieker