pvlib-python icon indicating copy to clipboard operation
pvlib-python copied to clipboard

SURFRAD site & date-range download

Open mikofski opened this issue 3 years ago • 9 comments

Is your feature request related to a problem? Please describe. the current SURFRAD iotools only reads in a singe day .dat from either an URL or a filesystem, EG:

# read from url
pvlib.iotools.read_surfrad('ftp://aftp.cmdl.noaa.gov/data/radiation/surfrad/Bondville_IL/2021/bon21001.dat')
# read from file
pvlib.iotools.read_surfrad('bon21001.dat')

Unfortunately, I can't quickly read an entire range or any arbitrarily large date range. I can use pvlib.iotools.read_surfrad in a loop, but it takes a long time to serially read in an entire year. Maybe it would be faster if I already had the files downloaded. It takes about 1-second to read a single 111kb file. So for 10,000 files that would be about 3 hours which is too long if I have to read 7 sites.

%%timeit
bon95 = [
    pvl.iotools.read_surfrad(r'ftp://aftp.cmdl.noaa.gov/data/radiation/surfrad/Bondville_IL/1995/bon95%03d.dat' % (x+1))
    for x in range(16)]  # read in 16 files

## -- End pasted text --
14.4 s ± 295 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

That's 14.4[s] / 16[files] = 0.9[s] per file. I tried to use threading, but then I get connection errors. I think there's a limit of 5 connections to the NOAA ftp from your computer. That should bring it down to about 30 minutes, hmm, maybe I didn't try hard enough? Anyway, I went a different way.

Describe the solution you'd like The current read_surfrad uses Python's urllib.requests.urlopen for each connection. I have found that opening a long lasting FTP connection using Python's ftplib allows downloading a lot more files by reusing the same connection. However this download is still serial, so I have found in addition using Python threading allows me to open up to 5 simultaneous connections, but any more and I get a 421 FTP Connection Error, too many connections.

Describe alternatives you've considered I was able to open the FTP site directly in Windows, but it was also a serial connection, and so for about 10,000 (about 1gb) would have taken 4 hours. By contrast, using ftplib and threading I can download all of the data from a single site in about 25 mintes.

Additional context #590 #595 gist of my working script: https://gist.github.com/mikofski/30455056b88a5d161598856cc4eedb2c

mikofski avatar Jan 30 '21 07:01 mikofski

Maybe I should've posted this in the group first? Is there any appetite for this? Better as a script or as a module? Logging okay or less?

mikofski avatar Jan 30 '21 07:01 mikofski

I think the existing function is OK as is because SURFRAD publishes daily files. Its not the intent to use that function to read a year of data.

I have downloaded years of SURFRAD data using wget. Its not fast but its a single command line statement. Reading into memory isn't too bad using the pvlib function.

But I can see the utility of having both steps in python. What about adding a script to the example gallery as a first step? I'm cautious about adding a get_surfrad function since these functions are the most troublesome to maintain.

cwhanse avatar Feb 01 '21 16:02 cwhanse

Following the patterns in some other iotools modules, I'll suggest 1. refactoring the io components out of read_surfrad so that it only performs the parsing and 2. making a new function read_surfrad_from_noaa_ftp(site, start, end) that manages a thread pool. It seems to me that work should be dispatched by the day, not by the year.

I'm leery of adding an example that uses a thread pool since any non-trivial io in the docs seems to eventually cause problems.

wholmgren avatar Feb 01 '21 19:02 wholmgren

Thanks all!

I think the existing function is OK

I totally agree! After iterating a bit I decided the existing parsing function is fine, but I just wanted a faster way to download the raw SURFRAD .dat files. For 7 sites and 25 years of data, and my last minute work ethic, waiting 28 hours just wasn't feasible 🤣

It seems to me that work should be dispatched by the day, not by the year.

This might work. It's not the way I started, but it could be more convenient for folks who want a date range, especially within a single year. There's a limit to how many FTP connections NOAA will accept, it seems to be exactly 5. Also an existing FTP connection is capable of downloading many files serially, quite quickly. Also the FTP connection is like a file system, I think I can use a full path, but I've been changing directories. So in theory we could open up 5 connections, break up the date range into 5 chunks, and then read them until they're done. That makes a lot of sense, probably more straightforward than my approach.

Thanks!

mikofski avatar Feb 02 '21 21:02 mikofski

@mikofski Pull request #1254 adds a retrieval function for the monthly data files on the BSRN server. As SURFRAD is part of BSRN this should offer a much quicker way of getting SURFRAD data and perhaps this issue can be closed?

It's worth mentioning that the SURFRAD files include some additional data that the BSRN files do not, such as wind speed and direction and a corresponding flag column for each variable.

AdamRJensen avatar Jul 21 '21 19:07 AdamRJensen

let me mull it over, I don't know how much overlap there is but my gut tells me folks will still want to use the raw SURFRAD iotools.

mikofski avatar Jul 27 '21 06:07 mikofski

It may be worth benchmarking the retrieval speeds for each data source before trying to improve the raw surfrad fetch. But removing the existing surfrad fetch/read functions is not on the table.

wholmgren avatar Jul 27 '21 15:07 wholmgren

@mikofski As discussed in #1459, SURFRAD files are both available via FTP and more recently HTTPS. It seems there is a significant performance gain (at least a factor of two) to be had by using the HTTPS links (see test below). I figured this might be relevant information to this issue.

image

AdamRJensen avatar Aug 05 '22 15:08 AdamRJensen

Wow, that's 3 times faster, but still over a day for 25 years of data. @AdamRJensen can you ask your contact how many HTTPS connections are allowed from the same host? I still think threading this request is the way to go? But maybe we leave that to the user?

Any complaints if I close this issue now? I don't think I'll work on it, and funny thing is you only need to download the SURFRAD data once. Maybe this is better as an gallery example?

mikofski avatar Aug 05 '22 21:08 mikofski