Workflow for downloading input forcing files without GPU node internet access
Hi all,
I am trying to run ClimaOcean on the Gadi supercomputer in Australia, and only the login CPU node has internet access on the HPC (for security reasons).
This means that I can't run examples that require downloaded files without first manually downloading the input files, placing them in the necessary folders, and then submitting a job to the GPU or CPU nodes on the HPC. This is an OK workaround, but I realised that as others run this model, they may also be using HPC environments that don't have internet access outside of the login node.
I just wanted to flag this as a potential issue, and to discuss whether it may be worth developing a workflow which avoids this need to manually download input files prior to running the model. This may be something that is unavoidable, but I figured I would flag it! Thanks
The files go to JULIA_DEPOT_PATH. Can you set your JULIA_DEPOT_PATH to point to a place that is accessible from the GPU nodes?
As you say, I think the way to run these cases is to initiate the simulation on CPU, but at a very coarse resolution and only running for a short amount of time.
Do you think that is an acceptable workflow? Perhaps, rather than using a simulation we could develop a utility that's something like download_data(model)? That would avoid the step of having to use run!(simulation).
The issue isn't file access - the GPU nodes are able to access the depot path. The issue is that any attempt to download data (using wget, curl etc.) doesn't work because the GPU nodes don't have internet access.
Yes, the ideal workflow would be to create a standalone function that can be run on the login node (for example, I have written a bash script with wget that downloads the JRA, ECCO and Bathymetry files into the necessary folders directly from the login node). I may be overblowing the issue, but I think CPU/GPU nodes in many HPCs don't have internet access. So integrating a simple download script that can be run from the login node prior to run!(simulation) would help those running on such HPC environments.
Hope that makes sense!
Why do you have to change the folders that the data is downloaded into?
I don't change the folders the data is downloaded into. I just manually download the data into the folders that code would ordinarily download the data into (if it had internet access).
If I understand correctly @taimoorsohail is wrote a bash script to do what the proposed method download_data(model) would be doing, right?
Actually minor note: the CPU nodes on HPCs often don’t have internet access either. So the issue is not GPU specific.
If you run the same script on the login node, using CPU architecture (and coarse resolution and say changing the stop_iteration=1), does it achieve the desired effect?
Also it'd be great to see the bash script!! I am confused why bash is better than julia, but I might be missing something. Possibly, if there is a bash script then we can simply translate the same commands into julia.
It may also be possible to hand a function all of the metadata / other objects that may be associated with data.
The challenge I think is that the data is not explicitly tied to the model. For example, we provide functionality for users to force their model with restoring to ECCO. But they need not set it up the same way every time. They could use a callback, or a forcing function. It is not rigid. So it may be hard to serve a function that is guaranteed to work. We can serve a function that makes many assumptions about a typical setup, looks for data in the typical place, etc. But at that point I am not sure we have made much progress.
A more robust strategy is to run the script that we want to run on the login node, perhaps at low resolution and for a short time. I think that should trigger all the downloads that would be needed for the simulation. It's robust because we directly use the same script that would be used for the simulation itself. It may not reuqire much more manual intervention from the user, since using a function utility is similarly difficult as changing architecture / problem size?
Curious to hear thoughts.
Also it'd be great to see the bash script!! I am confused why bash is better than julia, but I might be missing something. Possibly, if there is a bash script then we can simply translate the same commands into julia.
Nobody claimed that the bash script is better than Julia (yet, right?).
But sure, the bash script could be translated to Julia easily (think so) and this should be a good starting point for the download_data(model) method ;)
Hm... I see @glwagner's point. For the bathymetry, it should be straightforward to download the raw data before any regrinding etc happens to it. But yes, it's not until the users construct a coupled model with atmosphere the simulation has all the available information, right?
Perhaps providing a keyword to the appropriate methods to use data from /local/directory/this/and/that/ is more robust. But also if we can have specific methods like download_raw_bathymetry_ETOPO(), download_raw_JRA55_RYF(), download_raw_ECCO()? Does this make sense?
If you run the same script on the login node, using CPU architecture (and coarse resolution and say changing the stop_iteration=1), does it achieve the desired effect?
I agree with @glwagner that it makes sense to just run the same script with a coarse grid to download the necessary data. The issue, however, is that the login node has additional storage and walltime constraints. So, if I run run!(simulation) on the login node to download the necessary files, it kills the job after 15 minutes (the default walltime limit on this HPC) - as a result, I download the ETOPO and ECCO data but not the raw JRA55 data. The bash script doesn't seem to have this constraint when using wget. Not sure what the reason is...
Github isn't allowing uploading bash scripts so I'll just paste it below.
# Array of file URLs and their corresponding names
declare -A files=(
["RYF.rsds.1990_1991.nc"]="https://www.dropbox.com/scl/fi/z6fkvmd9oe3ycmaxta131/RYF.rsds.1990_1991.nc?rlkey=r7q6zcbj6a4fxsq0f8th7c4tc&dl=1"
["RYF.friver.1990_1991.nc"]="https://www.dropbox.com/scl/fi/21ggl4p74k4zvbf04nb67/RYF.friver.1990_1991.nc?rlkey=ny2qcjkk1cfijmwyqxsfm68fz&dl=1"
["RYF.prra.1990_1991.nc"]="https://www.dropbox.com/scl/fi/5icl1gbd7f5hvyn656kjq/RYF.prra.1990_1991.nc?rlkey=iifyjm4ppwyd8ztcek4dtx0k8&dl=1"
["RYF.prsn.1990_1991.nc"]="https://www.dropbox.com/scl/fi/1r4ajjzb3643z93ads4x4/RYF.prsn.1990_1991.nc?rlkey=auyqpwn060cvy4w01a2yskfah&dl=1"
["RYF.licalvf.1990_1991.nc"]="https://www.dropbox.com/scl/fi/44nc5y27ohvif7lkvpyv0/RYF.licalvf.1990_1991.nc?rlkey=w7rqu48y2baw1efmgrnmym0jk&dl=1"
["RYF.huss.1990_1991.nc"]="https://www.dropbox.com/scl/fi/66z6ymfr4ghkynizydc29/RYF.huss.1990_1991.nc?rlkey=107yq04aew8lrmfyorj68v4td&dl=1"
["RYF.psl.1990_1991.nc"]="https://www.dropbox.com/scl/fi/0fk332027oru1iiseykgp/RYF.psl.1990_1991.nc?rlkey=4xpr9uah741483aukok6d7ctt&dl=1"
["RYF.rhuss.1990_1991.nc"]="https://www.dropbox.com/scl/fi/1agwsp0lzvntuyf8bm9la/RYF.rhuss.1990_1991.nc?rlkey=8cd0vs7iy1rw58b9pc9t68gtz&dl=1"
["RYF.rlds.1990_1991.nc"]="https://www.dropbox.com/scl/fi/y6r62szkirrivua5nqq61/RYF.rlds.1990_1991.nc?rlkey=wt9yq3cyrvs2rbowoirf4nkum&dl=1"
["RYF.rsds.1990_1991.nc"]="https://www.dropbox.com/scl/fi/z6fkvmd9oe3ycmaxta131/RYF.rsds.1990_1991.nc?rlkey=r7q6zcbj6a4fxsq0f8th7c4tc&dl=1"
["RYF.tas.1990_1991.nc"]="https://www.dropbox.com/scl/fi/fpl0npwi476w635g6lke9/RYF.tas.1990_1991.nc?rlkey=0skb9pe6lgbfbiaoybe7m945s&dl=1"
["RYF.uas.1990_1991.nc"]="https://www.dropbox.com/scl/fi/86wetpqla2x97isp8092g/RYF.uas.1990_1991.nc?rlkey=rcaf18sh1yz0v9g4hjm1249j0&dl=1"
["RYF.vas.1990_1991.nc"]="https://www.dropbox.com/scl/fi/d38sflo9ddljstd5jwgml/RYF.vas.1990_1991.nc?rlkey=f9y3e57kx8xrb40gbstarf0x6&dl=1"
)
for file in "${!files[@]}"; do
echo "Downloading $file..."
wget -O "$file" "${files[$file]}"
done
echo "All files downloaded!"
Always prefer pasting scripts rather than links
I think ETOPO and ECCO are downloaded by regrid_bathymetry! and set! respectively?
PS I think this issue should be brought up in parallel with the admin of the HPC center. Internet on nodes is likely not changeable, but login constraints probably are right? 15 min is too short to download large files. From the info provided, it sounds like the HPC actually somehow requires one to use bash. I don't think using different julia functions will solve the 15 min issue?
Ok, so it looks like to me that one does not have to get to run! to download the JRA55 data. We only have to build the prescribed atmosphere since
https://github.com/CliMA/ClimaOcean.jl/blob/459d76da9ebe6e24d0fc6381b134d140c96bc018/src/DataWrangling/JRA55.jl#L416
since
https://github.com/CliMA/ClimaOcean.jl/blob/459d76da9ebe6e24d0fc6381b134d140c96bc018/src/DataWrangling/JRA55.jl#L674-L682
are all within JRA55PrescribedAtmosphere, then I think we just need
JRA55PrescribedAtmosphere(arch, time_indices; kw...)
somewhere.
As far as I can tell, it does not matter that time_indices are. The dataset cannot be divided prior to downloading, so the entire year from 1993-1994 is downloaded no matter what.
As for ETOPO1 data, we currently call this within regrid_bathymetry:
https://github.com/CliMA/ClimaOcean.jl/blob/459d76da9ebe6e24d0fc6381b134d140c96bc018/src/Bathymetry.jl#L104-L106
so we just need to isolate this in a function like
function download_bathymetry_data(
url = "https://www.ngdc.noaa.gov/thredds/fileServer/global/ETOPO2022/60s/60s_surface_elev_netcdf",
filename = "ETOPO_2022_v1_60s_N90W180_surface.nc",
progress = download_progress)
filepath = joinpath(dir, filename)
fileurl = url * "/" * filename # joinpath on windows creates the wrong url
Downloads.download(fileurl, filepath; progress=download_progress)
return nothing
end
then users can call
using ClimaOcean.Bathymetry: download_bathymetry_data
download_bathymetry_data()
to download it.
Finally for ECCOMetadata we have this function:
https://github.com/CliMA/ClimaOcean.jl/blob/459d76da9ebe6e24d0fc6381b134d140c96bc018/src/DataWrangling/ECCO/ECCO_metadata.jl#L234
Exposing the download_dataset functions has resolved this issue - closing