ClimaOcean.jl Workflow for downloading input forcing files without GPU node internet access

Hi all,

I am trying to run ClimaOcean on the Gadi supercomputer in Australia, and only the login CPU node has internet access on the HPC (for security reasons).

This means that I can't run examples that require downloaded files without first manually downloading the input files, placing them in the necessary folders, and then submitting a job to the GPU or CPU nodes on the HPC. This is an OK workaround, but I realised that as others run this model, they may also be using HPC environments that don't have internet access outside of the login node.

I just wanted to flag this as a potential issue, and to discuss whether it may be worth developing a workflow which avoids this need to manually download input files prior to running the model. This may be something that is unavoidable, but I figured I would flag it! Thanks

Feb 20 '25 03:02 taimoorsohail

The files go to JULIA_DEPOT_PATH. Can you set your JULIA_DEPOT_PATH to point to a place that is accessible from the GPU nodes?

As you say, I think the way to run these cases is to initiate the simulation on CPU, but at a very coarse resolution and only running for a short amount of time.

Do you think that is an acceptable workflow? Perhaps, rather than using a simulation we could develop a utility that's something like download_data(model)? That would avoid the step of having to use run!(simulation).

Feb 20 '25 05:02 glwagner

The issue isn't file access - the GPU nodes are able to access the depot path. The issue is that any attempt to download data (using wget, curl etc.) doesn't work because the GPU nodes don't have internet access.

Yes, the ideal workflow would be to create a standalone function that can be run on the login node (for example, I have written a bash script with wget that downloads the JRA, ECCO and Bathymetry files into the necessary folders directly from the login node). I may be overblowing the issue, but I think CPU/GPU nodes in many HPCs don't have internet access. So integrating a simple download script that can be run from the login node prior to run!(simulation) would help those running on such HPC environments.

Hope that makes sense!

Feb 20 '25 06:02 taimoorsohail

Why do you have to change the folders that the data is downloaded into?

Feb 20 '25 13:02 glwagner

I don't change the folders the data is downloaded into. I just manually download the data into the folders that code would ordinarily download the data into (if it had internet access).

Feb 20 '25 23:02 taimoorsohail

If I understand correctly @taimoorsohail is wrote a bash script to do what the proposed method download_data(model) would be doing, right?

Feb 20 '25 23:02 navidcy

Actually minor note: the CPU nodes on HPCs often don’t have internet access either. So the issue is not GPU specific.

Feb 20 '25 23:02 navidcy

If you run the same script on the login node, using CPU architecture (and coarse resolution and say changing the stop_iteration=1), does it achieve the desired effect?

Feb 21 '25 00:02 glwagner

Also it'd be great to see the bash script!! I am confused why bash is better than julia, but I might be missing something. Possibly, if there is a bash script then we can simply translate the same commands into julia.

It may also be possible to hand a function all of the metadata / other objects that may be associated with data.

The challenge I think is that the data is not explicitly tied to the model. For example, we provide functionality for users to force their model with restoring to ECCO. But they need not set it up the same way every time. They could use a callback, or a forcing function. It is not rigid. So it may be hard to serve a function that is guaranteed to work. We can serve a function that makes many assumptions about a typical setup, looks for data in the typical place, etc. But at that point I am not sure we have made much progress.

A more robust strategy is to run the script that we want to run on the login node, perhaps at low resolution and for a short time. I think that should trigger all the downloads that would be needed for the simulation. It's robust because we directly use the same script that would be used for the simulation itself. It may not reuqire much more manual intervention from the user, since using a function utility is similarly difficult as changing architecture / problem size?

Curious to hear thoughts.

Feb 21 '25 00:02 glwagner

Also it'd be great to see the bash script!! I am confused why bash is better than julia, but I might be missing something. Possibly, if there is a bash script then we can simply translate the same commands into julia.

Nobody claimed that the bash script is better than Julia (yet, right?). But sure, the bash script could be translated to Julia easily (think so) and this should be a good starting point for the download_data(model) method ;)

Feb 21 '25 01:02 navidcy

Hm... I see @glwagner's point. For the bathymetry, it should be straightforward to download the raw data before any regrinding etc happens to it. But yes, it's not until the users construct a coupled model with atmosphere the simulation has all the available information, right?

Perhaps providing a keyword to the appropriate methods to use data from /local/directory/this/and/that/ is more robust. But also if we can have specific methods like download_raw_bathymetry_ETOPO(), download_raw_JRA55_RYF(), download_raw_ECCO()? Does this make sense?

Feb 21 '25 01:02 navidcy

If you run the same script on the login node, using CPU architecture (and coarse resolution and say changing the stop_iteration=1), does it achieve the desired effect?

I agree with @glwagner that it makes sense to just run the same script with a coarse grid to download the necessary data. The issue, however, is that the login node has additional storage and walltime constraints. So, if I run run!(simulation) on the login node to download the necessary files, it kills the job after 15 minutes (the default walltime limit on this HPC) - as a result, I download the ETOPO and ECCO data but not the raw JRA55 data. The bash script doesn't seem to have this constraint when using wget. Not sure what the reason is...

Github isn't allowing uploading bash scripts so I'll just paste it below.


# Array of file URLs and their corresponding names
declare -A files=(
    ["RYF.rsds.1990_1991.nc"]="https://www.dropbox.com/scl/fi/z6fkvmd9oe3ycmaxta131/RYF.rsds.1990_1991.nc?rlkey=r7q6zcbj6a4fxsq0f8th7c4tc&dl=1"
    ["RYF.friver.1990_1991.nc"]="https://www.dropbox.com/scl/fi/21ggl4p74k4zvbf04nb67/RYF.friver.1990_1991.nc?rlkey=ny2qcjkk1cfijmwyqxsfm68fz&dl=1"
    ["RYF.prra.1990_1991.nc"]="https://www.dropbox.com/scl/fi/5icl1gbd7f5hvyn656kjq/RYF.prra.1990_1991.nc?rlkey=iifyjm4ppwyd8ztcek4dtx0k8&dl=1"
    ["RYF.prsn.1990_1991.nc"]="https://www.dropbox.com/scl/fi/1r4ajjzb3643z93ads4x4/RYF.prsn.1990_1991.nc?rlkey=auyqpwn060cvy4w01a2yskfah&dl=1"
    ["RYF.licalvf.1990_1991.nc"]="https://www.dropbox.com/scl/fi/44nc5y27ohvif7lkvpyv0/RYF.licalvf.1990_1991.nc?rlkey=w7rqu48y2baw1efmgrnmym0jk&dl=1"
    ["RYF.huss.1990_1991.nc"]="https://www.dropbox.com/scl/fi/66z6ymfr4ghkynizydc29/RYF.huss.1990_1991.nc?rlkey=107yq04aew8lrmfyorj68v4td&dl=1"
    ["RYF.psl.1990_1991.nc"]="https://www.dropbox.com/scl/fi/0fk332027oru1iiseykgp/RYF.psl.1990_1991.nc?rlkey=4xpr9uah741483aukok6d7ctt&dl=1"
    ["RYF.rhuss.1990_1991.nc"]="https://www.dropbox.com/scl/fi/1agwsp0lzvntuyf8bm9la/RYF.rhuss.1990_1991.nc?rlkey=8cd0vs7iy1rw58b9pc9t68gtz&dl=1"
    ["RYF.rlds.1990_1991.nc"]="https://www.dropbox.com/scl/fi/y6r62szkirrivua5nqq61/RYF.rlds.1990_1991.nc?rlkey=wt9yq3cyrvs2rbowoirf4nkum&dl=1"
    ["RYF.rsds.1990_1991.nc"]="https://www.dropbox.com/scl/fi/z6fkvmd9oe3ycmaxta131/RYF.rsds.1990_1991.nc?rlkey=r7q6zcbj6a4fxsq0f8th7c4tc&dl=1"
    ["RYF.tas.1990_1991.nc"]="https://www.dropbox.com/scl/fi/fpl0npwi476w635g6lke9/RYF.tas.1990_1991.nc?rlkey=0skb9pe6lgbfbiaoybe7m945s&dl=1"
    ["RYF.uas.1990_1991.nc"]="https://www.dropbox.com/scl/fi/86wetpqla2x97isp8092g/RYF.uas.1990_1991.nc?rlkey=rcaf18sh1yz0v9g4hjm1249j0&dl=1"
    ["RYF.vas.1990_1991.nc"]="https://www.dropbox.com/scl/fi/d38sflo9ddljstd5jwgml/RYF.vas.1990_1991.nc?rlkey=f9y3e57kx8xrb40gbstarf0x6&dl=1"
)

for file in "${!files[@]}"; do
    echo "Downloading $file..."
    wget -O "$file" "${files[$file]}"
done

echo "All files downloaded!"

Feb 21 '25 02:02 taimoorsohail

Always prefer pasting scripts rather than links

Feb 21 '25 03:02 glwagner

I think ETOPO and ECCO are downloaded by regrid_bathymetry! and set! respectively?

Feb 21 '25 03:02 glwagner

PS I think this issue should be brought up in parallel with the admin of the HPC center. Internet on nodes is likely not changeable, but login constraints probably are right? 15 min is too short to download large files. From the info provided, it sounds like the HPC actually somehow requires one to use bash. I don't think using different julia functions will solve the 15 min issue?

Feb 21 '25 03:02 glwagner

Ok, so it looks like to me that one does not have to get to run! to download the JRA55 data. We only have to build the prescribed atmosphere since

https://github.com/CliMA/ClimaOcean.jl/blob/459d76da9ebe6e24d0fc6381b134d140c96bc018/src/DataWrangling/JRA55.jl#L416

since

https://github.com/CliMA/ClimaOcean.jl/blob/459d76da9ebe6e24d0fc6381b134d140c96bc018/src/DataWrangling/JRA55.jl#L674-L682

are all within JRA55PrescribedAtmosphere, then I think we just need

JRA55PrescribedAtmosphere(arch, time_indices; kw...)

somewhere.

As far as I can tell, it does not matter that time_indices are. The dataset cannot be divided prior to downloading, so the entire year from 1993-1994 is downloaded no matter what.

Feb 21 '25 03:02 glwagner

As for ETOPO1 data, we currently call this within regrid_bathymetry:

https://github.com/CliMA/ClimaOcean.jl/blob/459d76da9ebe6e24d0fc6381b134d140c96bc018/src/Bathymetry.jl#L104-L106

so we just need to isolate this in a function like

function download_bathymetry_data(
    url = "https://www.ngdc.noaa.gov/thredds/fileServer/global/ETOPO2022/60s/60s_surface_elev_netcdf",
    filename = "ETOPO_2022_v1_60s_N90W180_surface.nc",
    progress = download_progress)

    filepath = joinpath(dir, filename)
    fileurl  = url * "/" * filename # joinpath on windows creates the wrong url
    Downloads.download(fileurl, filepath; progress=download_progress)
    return nothing
end

then users can call

using ClimaOcean.Bathymetry: download_bathymetry_data
download_bathymetry_data()

to download it.

Feb 21 '25 04:02 glwagner

Finally for ECCOMetadata we have this function:

https://github.com/CliMA/ClimaOcean.jl/blob/459d76da9ebe6e24d0fc6381b134d140c96bc018/src/DataWrangling/ECCO/ECCO_metadata.jl#L234

Feb 21 '25 04:02 glwagner

Exposing the download_dataset functions has resolved this issue - closing

Oct 09 '25 05:10 taimoorsohail