pvoutput Collect PVoutput.org World wide data

trafficstars

Detailed Description

Need to download pv data for pv output.org for each country, its downloaded from this website

Context

metadata has been downloaded already - /mnt/storage_ssd_4tb/PVOutput/ on Leonardo

Possible Implementation

Update notebook - download_pv_timeseries.ipynb - (https://github.com/openclimatefix/pvoutput/blob/main/examples/download_pv_timeseries.ipynb))
Need to get credientials for the website
make script, inputs: metadata file (csv), out file filename (hdf)
save data to here, maybe in dir per country - /mnt/storage_b/data/ocf/solar_pv_nowcasting/nowcasting_dataset_pipeline/PV/PVOutput.org

Dec 01 '22 11:12 peterdudfield

@jacobbieker do you know where the metadata data is on Leonardo?

Dec 01 '22 11:12 peterdudfield

The script might be useful from - download_pv_timeseries.ipynb

Dec 01 '22 11:12 peterdudfield

/mnt/storage_ssd_4tb/PVOutput/

Dec 01 '22 12:12 jacobbieker

A couple of encounters leaning me in favour of a separation of base library code and import scripts that use the library:

Feel uncomfortable adding package requirements to the file environment for the scripts that the library doesn't need
Likewise, don't want to put script function tests with the libraries tests
Python relative imports might make it a pain to run the script locally/on leonardo

Have you any thoughts on these?

Also I noticed the specified command to install the library doesn't actually work - I've modified it on my branch. On a related note, is there a reason we aren't building the library as a wheel and using pypi to distribute it?

Not saying any of this needs doing imminently or at all, just some thoughts!

Dec 01 '22 18:12 devsjc

Hey @devsjc,

Yea if you want to do a seperate repo with scripts then thats all good. Slight advantages of putting it there is then other people could see how to use it, but totally up to you.
yea scripts should be very high level, so yea tests would shoudl not be there, perhaps another reason to move to a different repo.
relative imports are tricky, ive always tried to installed it using the library and then run scripts using that
thanks for fixing that, I thought yea we should release it to pypi, lots of our other repos do that, we can do an upgrade to that soon if we like

Dec 01 '22 19:12 peterdudfield

I'll leave it in here for now, to get my ducks in a row before I start making organisational changes! And will bear the above in mind during any further exploration or development that ends up needing doing.

First pass at a script is done on branch - I guess it's up to Jacob as to how useful that is vs the notebook it's derived from! See #118

Dec 02 '22 17:12 devsjc

That draft looks great! I cobbled together a script from what was here before, so its very not optimal and very hacky. So yeah, anything is an improvement over that!

Dec 02 '22 17:12 jacobbieker

This is still a first pass I would say, can definitely be improved upon if it's use case changes from the occasional non-critical manual import, and once I've a more holistic view of the problem! It is pretty much a like-for-like of your notebook but built for cli usage.

Dec 05 '22 09:12 devsjc

Lets try pulling data from 2016 onwards, @jacobbieker unless you think we should do more?

Dec 05 '22 14:12 peterdudfield

I would go back to 2014, as that's where our satellite data goes back to at the moment. NWPs are only 2016 for met office but we can get GFS or ERA5 for further back if we want

Dec 05 '22 14:12 jacobbieker

Began script import this morning pulling 2014 onwards.

Dec 06 '22 12:12 devsjc

@jacobbieker I would be grateful at some point if I could get some more info on what happens to this .hdf data - so far my understanding from Peter is that it ends up as .netcdf somehow, but it would be good to know where/why/how that process occurs so I can get a view of the whole data journey here.

Dec 06 '22 12:12 devsjc

@devsjc So I don't have the best idea of the whole thing, I hacked that script together to get data in the RSS field, but hadn't gotten around to understanding how to get the output into the netcdf files, I was looking at a script @peterdudfield made for converting the Italy PV data into the right format, so I can try finding that, but that didn't necessarily work with the Netherlands data or French data I downloaded before. The Netherlands was/is an odd one because it gave essentially 0 systems when I tried downloading everything even though it has ~11000 on PVOutput.orgs site

Dec 06 '22 14:12 jacobbieker

Okay thanks! So it is netcdf that is the format you're importing in the end to train with? And I guess that's via mounting the volume and reading in the files individually in python?

Dec 06 '22 14:12 devsjc

Yeah, so here is the code we use for loading the netCDF files for training: https://github.com/openclimatefix/ocf_datapipes/blob/a687eee51545411a0e80ff13ef49d1c67b1dca5b/ocf_datapipes/load/pv/pv.py#L26-L65

Dec 06 '22 14:12 jacobbieker

And yeah, each file is read individually, we pass in the metadata and the netcdf file at the same time though

Dec 06 '22 14:12 jacobbieker

load_everything_into_ram(
    pv_power_filename,
    pv_metadata_filename,
   ...
   )

pv_power_filename is referring to the netcdf file is it?

Dec 06 '22 14:12 devsjc

Yes

Dec 06 '22 14:12 jacobbieker

So once loaded in, all the data from each individual country's .netcdf file, and the system metadata for all the systems in that country, are all joined together and fully loaded into ram? The country and file split becomes moot?

Apologies for being slow!

Dec 07 '22 10:12 devsjc

I would keep things split up into countries. Reason is that

at the moment models normally just take one country in
the ML config files has the options of taking in several .netcdf files, so we can select mulitple datasets

Dont apologies

Dec 07 '22 10:12 peterdudfield

That makes sense.

And in the end these .netcdf files are read in using xarray to a ~~pandas dataframe~~ xarray DataArray, which is the final form in which the data is passed to ML using the pytorch datapipe?

Also - and take this with the knowledge that I have very limited pytorch experience so absolutely could be reading it wrong - it seems like the streaming datapipe is loading everything into memory and streaming from that as opposed to streaming over the files themselves?

Dec 07 '22 10:12 devsjc

All data has been loaded onto Leonardo, at /mnt/storage_ssd_4tb/PVOutput.

Dec 14 '22 11:12 devsjc

pvoutput pvoutput copied to clipboard

Collect PVoutput.org World wide data

Detailed Description

Context

Possible Implementation

pvoutput
pvoutput copied to clipboard