Oceananigans.jl Doing better than `overwriting

Both JLD2 and NetCDF output writers currently support a keyword argument overwrite_existing, which indicates, if a file is found that has the same name as the currently requested output, whether that file should be deleted or "overwritten". For example,

https://github.com/CliMA/Oceananigans.jl/blob/1775f2ba9cc2e00bf53f1f36d93f72541d868287/examples/two_dimensional_turbulence.jl#L110-L113

Basically, users probably almost always want to set this true for convenience (even though the default is false). And that's why we do it in the example too. Though it seems like a pointless always-required line, we've nevertheless reasoned that users should be aware that data might be lost... sort of like signing a waiver...

In a recent conversation @Sbozzolo suggested that this is a bit silly and I have to agree it is sort of pointless red tape in some ways, rather like "accepting cookies" every time we have to visit a website. We don't really think about that anymore, we just click as fast as we can so we can move on with out lives...

I think @Sbozzolo might have some idea for how to do this better. More or less I think the gist is that, rather than having a system where an output writer might have to delete a file, we instead create a directory system where new output is always saved in a unique directory. In other words, rather than saving user output at

filepath = joinpath(dir, filename)

we would save output at the path

filepath = joinpath(dir, unique_simulation_id, filename)

The upside of this system is that the output writers are relieved of any potential need to delete data. That onus is passed to the user instead, where the responsibility belongs.

The downside is that we have to generate the directory name unique_simulation_id. No matter what we choose, its going to require effort from users to interpret and learn. It also has the major downside of "hiding" information from users: they'll run a script, and then hunt around for the data that was saved. No matter what naming system we choose for unique_simulation_id, I think it makes it harder for users to find their data.

Finally, we should note of course that there's no reason why users can't do this themselves in their own scripts. We don't have to make directories for them, they can simply generate ID's themselves and mkdir. If the user is sophisticated enough to be running lots of experiments with highly valuable data, they can probably figure out how to create directories...

We're also entering into dangerous territory I think --- trying to manage user's workflows. Workflow management tools are good in general, of course, but I just think its sort of hard to do well and in a general enough way that is uniformly useful to everyone. So by wading into this area, we risk doing a crap job and interfering / hindering at least some people, rather than helping.

Anyways, after writing this out I'm a little wary of introducing anything now (maybe actually showing how to integrate workflow management tools into Oceananigans scripts is a better solution). But I thought it would be useful to open this up for discussion.

Apr 06 '24 18:04 glwagner

Here's an example of how a user might generate unique ID's on their own without requiring our meddling:

using Dates
id = string("run_starting_at_", now()) # eg "run_starting_at_2024-04-06T14:16:16.083"

simulation.output_writers[:fields] = JLD2OutputWriter(model, outputs;
                                                      schedule = TimeInterval(0.6), 
                                                      filename = "pretty_cool_data.jld2", 
                                                      dir = id)

This savvy user then never has to write overwrite_existing=true as long as they don't run two simulations within a millisecond of one another (I guess if that's possible, a bit more work is needed from the user to make a unique id).

Apr 06 '24 18:04 glwagner

So depending on how we perceive the importance of this issue we could add docs and an example illustrating this workflow to users, as an alternative to changing the source code.

Apr 06 '24 18:04 glwagner

I started tackling this issue in ClimaAtmos last week. I wrote a module, OutputPathGenerator, in a separate utilities package (documentation).

This module defines an object, OutputPathGenerator that can be extended with different OutputPathGeneratorStyles. The OutputPathGenerator is used in a generate_output_path function that takes the base output dir and the style. The simplest of such styles is "overwrite".

The style that is currently being used in Atmos is ActiveLinkStyle. Citing from the docs:

This style provides a more convenient and non-destructive approach. It manages a sequence of subfolders within the base directory specified by output_path. It also creates a symbolic link named output_active that points to the current active subfolder. This allows you to easily access the latest simulation results with a predictable path.

Example: Let's assume your output_path is set to data. If data doesn't exist, the module creates it and returns data/output_active. This link points to the newly created subfolder data/output_0000. If data exists and contains an output_active link pointing to data/output_0005, the module creates a new subfolder data/output_0006 and updates output_active to point to it. If data exists with or without an output_active link, the module checks for existing subfolders named data/output_XXXX (with XXXX a number). If none are found, it creates data/output_0000 and a link data/output_active pointing to it.

Atmos uses OutputPathGenerator internally. My vision is that end users would be providing the base path and possibly choosing a Style if they don't want the default behavior (which is the ActiveLinkStyle). Styles are Julia objects and new ones can be defined in scripts by implementing a method for the function generate_output_path.

Apr 06 '24 19:04 Sbozzolo

More notes:

A third possibility is to add a feature that generates unique ID's, but make it an optional property of Simulation, something like unique_output_dir = true or unique_output_dir = NowString(prefix). That sits between "always" doing it, and merely showing users how to do it themselves.
To implement this change we have to hold off initializing the output writer files until run!(simulation). We want to do this anyways for more convenient checkpointing so that's a good change...
We might have to think a bit about whether the dir for output writers should become a relative path vs absolute path (as it is now) if we do something like this. Or maybe we just want to remove dir completely from the output writers, and put directory management in Simulation instead.

Apr 06 '24 19:04 glwagner

Not sure this is related, but while I usually use other languages to load and plot the outputs, I always come across permission denied when I have the file open by another kernel. I was wondering if there is a way to bypass that and force the overwrite.

May 06 '24 22:05 iuryt

Oceananigans.jl
Oceananigans.jl copied to clipboard

Doing better than `overwriting_existing=true` with output?

Oceananigans.jl Oceananigans.jl copied to clipboard

Doing better than `overwriting_existing=true` with output?

Oceananigans.jl
Oceananigans.jl copied to clipboard