pvoutput
pvoutput copied to clipboard
Collect PVoutput.org World wide data
Detailed Description
Need to download pv data for pv output.org for each country, its downloaded from this website
Context
metadata has been downloaded already - /mnt/storage_ssd_4tb/PVOutput/ on Leonardo
Possible Implementation
- Update notebook - download_pv_timeseries.ipynb - (https://github.com/openclimatefix/pvoutput/blob/main/examples/download_pv_timeseries.ipynb))
- Need to get credientials for the website
- make script, inputs: metadata file (csv), out file filename (hdf)
- save data to here, maybe in dir per country - /mnt/storage_b/data/ocf/solar_pv_nowcasting/nowcasting_dataset_pipeline/PV/PVOutput.org
@jacobbieker do you know where the metadata data is on Leonardo?
The script might be useful from - download_pv_timeseries.ipynb
/mnt/storage_ssd_4tb/PVOutput/
A couple of encounters leaning me in favour of a separation of base library code and import scripts that use the library:
- Feel uncomfortable adding package requirements to the file environment for the scripts that the library doesn't need
- Likewise, don't want to put script function tests with the libraries tests
- Python relative imports might make it a pain to run the script locally/on leonardo
Have you any thoughts on these?
Also I noticed the specified command to install the library doesn't actually work - I've modified it on my branch. On a related note, is there a reason we aren't building the library as a wheel and using pypi to distribute it?
Not saying any of this needs doing imminently or at all, just some thoughts!
Hey @devsjc,
-
Yea if you want to do a seperate repo with scripts then thats all good. Slight advantages of putting it there is then other people could see how to use it, but totally up to you.
-
yea scripts should be very high level, so yea tests would shoudl not be there, perhaps another reason to move to a different repo.
-
relative imports are tricky, ive always tried to installed it using the library and then run scripts using that
-
thanks for fixing that, I thought yea we should release it to
pypi, lots of our other repos do that, we can do an upgrade to that soon if we like
I'll leave it in here for now, to get my ducks in a row before I start making organisational changes! And will bear the above in mind during any further exploration or development that ends up needing doing.
First pass at a script is done on branch - I guess it's up to Jacob as to how useful that is vs the notebook it's derived from! See #118
That draft looks great! I cobbled together a script from what was here before, so its very not optimal and very hacky. So yeah, anything is an improvement over that!
This is still a first pass I would say, can definitely be improved upon if it's use case changes from the occasional non-critical manual import, and once I've a more holistic view of the problem! It is pretty much a like-for-like of your notebook but built for cli usage.
Lets try pulling data from 2016 onwards, @jacobbieker unless you think we should do more?
I would go back to 2014, as that's where our satellite data goes back to at the moment. NWPs are only 2016 for met office but we can get GFS or ERA5 for further back if we want
Began script import this morning pulling 2014 onwards.
@jacobbieker I would be grateful at some point if I could get some more info on what happens to this .hdf data - so far my understanding from Peter is that it ends up as .netcdf somehow, but it would be good to know where/why/how that process occurs so I can get a view of the whole data journey here.
@devsjc So I don't have the best idea of the whole thing, I hacked that script together to get data in the RSS field, but hadn't gotten around to understanding how to get the output into the netcdf files, I was looking at a script @peterdudfield made for converting the Italy PV data into the right format, so I can try finding that, but that didn't necessarily work with the Netherlands data or French data I downloaded before. The Netherlands was/is an odd one because it gave essentially 0 systems when I tried downloading everything even though it has ~11000 on PVOutput.orgs site
Okay thanks! So it is netcdf that is the format you're importing in the end to train with? And I guess that's via mounting the volume and reading in the files individually in python?
Yeah, so here is the code we use for loading the netCDF files for training: https://github.com/openclimatefix/ocf_datapipes/blob/a687eee51545411a0e80ff13ef49d1c67b1dca5b/ocf_datapipes/load/pv/pv.py#L26-L65
And yeah, each file is read individually, we pass in the metadata and the netcdf file at the same time though
load_everything_into_ram(
pv_power_filename,
pv_metadata_filename,
...
)
pv_power_filename is referring to the netcdf file is it?
Yes
So once loaded in, all the data from each individual country's .netcdf file, and the system metadata for all the systems in that country, are all joined together and fully loaded into ram? The country and file split becomes moot?
Apologies for being slow!
I would keep things split up into countries. Reason is that
- at the moment models normally just take one country in
- the ML config files has the options of taking in several .netcdf files, so we can select mulitple datasets
Dont apologies
That makes sense.
And in the end these .netcdf files are read in using xarray to a ~~pandas dataframe~~ xarray DataArray, which is the final form in which the data is passed to ML using the pytorch datapipe?
Also - and take this with the knowledge that I have very limited pytorch experience so absolutely could be reading it wrong - it seems like the streaming datapipe is loading everything into memory and streaming from that as opposed to streaming over the files themselves?
All data has been loaded onto Leonardo, at /mnt/storage_ssd_4tb/PVOutput.