Oceananigans.jl icon indicating copy to clipboard operation
Oceananigans.jl copied to clipboard

A new interface for output

Open glwagner opened this issue 1 year ago • 18 comments

I've long been unsatisfied with how we build output. It requires a lot of typing --- that is, boilerplate. I often feel a sense of dread when I have to go beyond "visualizing the final iteration" of a prototype to defining an output writer. So much typing.

This PR is an attempt to make output easier and more fun. I use JLD2 as an example but if there is some consensus then I think this PR should extend the same to NetCDF.

The main thrust of this PR is a new function called output!. It works like this:

output!(simulation, outputs; schedule=TimeInterval(1), filename="low_hanging_fruit")

The default is JLD2Format(). For NetCDF users would write

output!(simulation, outputs, NetCDFFormat(); kw...)

The function adds an output writer to simulation, choosing a "generic name" for the simulation.output_writers dictionary. Does this enable one line output writing?

I'd love to hear feedback about this design. I implemented it in the two_dimensional_turbulence.jl example for illustration:

https://github.com/CliMA/Oceananigans.jl/blob/6a6853ddf53c53c321c16393682b786e82ef5a8f/examples/two_dimensional_turbulence.jl#L109

There are two more things. First, we need to default with_halos=true for JLD2OutputWriter. The time has come because FieldTimeSeries is mature. We do, in fact, want halos.

The other conundrum is overwrite_existing which is discussed on #3543. In short I am wondering whether the best course of action is simply to make default overwrite_existing=true and solve so much boilerplate. Simulations are cheap, but life is short!

PS I also want to change add_callback! to just callback!.

glwagner avatar Sep 26 '24 02:09 glwagner

There's an intriguing side benefit of this "unified" interface for output. It means that it is possible (though we don't have it now) for users to specify an "output preference" in a Preferences.toml, which would then determine the default behavior of output!. I think it also legitimately makes it easier to switching between formats.

glwagner avatar Sep 26 '24 02:09 glwagner

Another abstraction I think would be useful is a utility for building multiple outputs. Imagine this:

indices = (xy=(:, :, k), xz=(:, 1, :), yz=(1, :, :))
sliced_outputs!(simulation, outputs, indices; schedule=TimeInterval(1), filename="sliced")

this would append filename with the keys of indices, eg we would get 3 outputs titled "sliced_xy", "sliced_xz", and "sliced_yz".

I personally find myself using indices the most for this kind of thing but others might have additional input.

glwagner avatar Sep 26 '24 03:09 glwagner

I'd love to hear feedback about this design.

Maybe I'm personally quite picky about specifying outputs and output file names, so I might always end up with verbose boilerplate for output writing (and I'm personally fine with that). But I'd support reducing boilerplate and maybe just a bit of flexibility would work even for picky people! I remember having this conversation in #1171 actually!

One-line flexible output writing would be especially great for examples, new user friendliness, and quick iteration.

Some thoughts:

  1. I think the name output! is a bit vague in what it does. Does it just output the current state of the simulation? Would add_output_writers! be clearer and align more closely with existing Oceananigans nomenclature? For the same reason, I'd suggest keeping add_callback! over renaming to callback!.
  2. I frequently output both JLD2 and NetCDF versions of the exact same data. So ideally subsequent calls to add_output_writers! with JLD2Format() and NetCDFFormat() would add both.
  3. Would definitely support extending this to NetCDF as well!

First, we need to default with_halos=true for JLD2OutputWriter. The time has come because FieldTimeSeries is mature. We do, in fact, want halos.

Sounds good! Will be nice for derivatives to work by default. Although FieldTimeSeries could do with a bit more maturing, e.g. issues #3144 and #3750.

The other conundrum is overwrite_existing which is discussed on #3543. In short I am wondering whether the best course of action is simply to make default overwrite_existing=true and solve so much boilerplate.

Is the "so much boilerplate" just the extra one line overwrite_existing = true for each output writer? Maybe I'm too conservative here but I think the default should be overwrite_existing = false just because the cost of overwriting and losing data can be very high.

But I really like the suggestions in #3543 of having the option to save output in unique directories be easily specifiable.

If we want an easy default, then maybe it could do some version of the unique directories? Or maybe have overwrite_existing = false (the default) just rename existing files?

I'm not sure of the best approach but as someone who's conservative about overwriting by default I'm tempted to err on the side of caution.

Another abstraction I think would be useful is a utility for building multiple outputs. Imagine this:

indices = (xy=(:, :, k), xz=(:, 1, :), yz=(1, :, :))
sliced_outputs!(simulation, outputs, indices; schedule=TimeInterval(1), filename="sliced")

this would append filename with the keys of indices, eg we would get 3 outputs titled "sliced_xy", "sliced_xz", and "sliced_yz".

Love this idea! Hoping that you can also pass e.g. (surface=(:, :, k), zonal=(:, 1, :), meridional=(1, :, :)) to get sliced_surface, sliced_zonal, and sliced_meridional.

ali-ramadhan avatar Sep 26 '24 04:09 ali-ramadhan

Maybe I'm personally quite picky about specifying outputs and output file names, so I might always end up with verbose boilerplate for output writing

Totally and to be clear, when we think about the economy of an interface, we are thinking about prototyping, illustrating, testing, not necessarily "production". I think "production" places fewer demands on the user interface and what we have now is ok for production. This PR mainly improves the small stuff. Also arguably it's more helpful for experienced than new users.

I think the name output! is a bit vague in what it does. Does it just output the current state of the simulation? Would add_output_writers! be clearer and align more closely with existing Oceananigans nomenclature? For the same reason, I'd suggest keeping add_callback! over renaming to callback!.

I agree that with "add" and "writer" the meaning is cemented. I think it's important to recognize trade-offs though, because there is a limit to the benefit of being explicit (when things become hard to read or understand). I think in this case I accept that output_writer! is probably better than ouput!. I think prepending add_ has a more marginal benefit (and is a little ugly) and that context is really what drives understanding of callback! / add_callback! (eg a schedule, etc). But this is certainly open for discussion.

Love this idea! Hoping that you can also pass e.g. (surface=(:, :, k), zonal=(:, 1, :), meridional=(1, :, :)) to get sliced_surface, sliced_zonal, and sliced_meridional.

Yes for sure! In that example the keys "xy", "xz", etc would be names appended to the filename prefix.

I think the default should be overwrite_existing = false just because the cost of overwriting and losing data can be very high.

Do you run with this option? Curious because I never use it. I think the cost of losing data is actually usually very small, it's only in a small 1% of cases that the data is valuable. I think that's actually the key insight behind the default, that expensive simulations are rare so it doesn't make sense to default it.

glwagner avatar Sep 26 '24 13:09 glwagner

Is the "so much boilerplate" just the extra one line overwrite_existing = true for each output writer?

It's one line --- but it's in every script, sometimes many times! Add all that up and you get to a huge amount...

glwagner avatar Sep 26 '24 13:09 glwagner

@navidcy any thoughts?

I think the main feedback is to keep add_callback! and use add_output_writer! instead. I think it's not very pretty but I am ok with it. It's utilitarian 🤖 .

glwagner avatar Oct 01 '24 12:10 glwagner

I agree with @ali-ramadhan, keeping a overwrite_existing seems quite important to me too, potentially with a default value of overwrite_existing = false.

Another issue to add into this discussion is the fact of how to handle output (e.g. file.nc) with checkpoints and file_splitting. Currently, when a simulation is pickup from a checkpoint having the flag overwrite_existing = true rewrites to empty all the pre-existing files (i.e. file_part1.nc, file_partN.nc). With overwrite_existing = false, the simulation crashes because it doesn't find the file file.nc. I think it will be useful to handle automatic concatenation to splitted files, to allow a more flexible output particularly in the context of HPC computing with manageable file sizes, chunks, and wall times.

josuemtzmo avatar Oct 03 '24 15:10 josuemtzmo

potentially with a default value of overwrite_existing = false.

Do you run with overwrite_existing=false? (Outside the context of restoring from a checkpoint.)

glwagner avatar Oct 03 '24 15:10 glwagner

Another issue to add into this discussion is the fact of how to handle output (e.g. file.nc) with checkpoints and file_splitting.

I propose handling this by initializing output files within run! rather than during instantiation of the output writer. This is separate from the interface discussion here though.

glwagner avatar Oct 03 '24 15:10 glwagner

I propose handling this by initializing output files within run! rather than during instantiation of the output writer. This is separate from the interface discussion here though.

Ok, I see. I agree that that discussion can be left to the other PR.

josuemtzmo avatar Oct 03 '24 16:10 josuemtzmo

potentially with a default value of overwrite_existing = false.

Do you run with overwrite_existing=false? (Outside the context of restoring from a checkpoint.)

I agree, that is not a common use case scenario. I have only used overwrite_existing=false without a checkpoint for short tests within the same Julia instance to extend the model output.

josuemtzmo avatar Oct 03 '24 16:10 josuemtzmo

I have only used overwrite_existing=false without a checkpoint for short tests within the same Julia instance to extend the model output.

Thank you for pointing out this use case. I think this is another situation that could be solved by waiting until run! for initialization. We can analyze an existing file and determine whether or not any data within the file will be overwritten based on the simulation parameters (current time, stop time).

Another idea by the way would be to move the concept of "overwriting" to run!, as well. Then the single keyword can apply to all output, or not, which presumably more aligned with what a user would want (rather than toggling overwrite_existing for each writer individually).

glwagner avatar Oct 03 '24 19:10 glwagner

Another idea after talking with @josuemtzmo: add another function called checkpoint!(simulation, schedule) that adds a checkpointer with an automatically generated name. (More to come on this idea)

glwagner avatar Oct 04 '24 15:10 glwagner

Another utility that I believe is needed is a function that displays the information in an output file. For example something like

julia> outputinfo(filename)

which displays things like

  • the names of the fields
  • information about the fields? location, indices, size
  • the grid
  • the number of output snapshots
  • the output times

anything else?

glwagner avatar Oct 05 '24 15:10 glwagner

I'm not too picky about add_output_writer! vs. output_writer! although output! is a bit too vague. I might lean too utilitarian though :robot: Curious what other people think.

I think the default should be overwrite_existing = false just because the cost of overwriting and losing data can be very high.

Do you run with this option? Curious because I never use it. I think the cost of losing data is actually usually very small, it's only in a small 1% of cases that the data is valuable. I think that's actually the key insight behind the default, that expensive simulations are rare so it doesn't make sense to default it.

I do actually haha. But what I also do is set different output directories for each run so I can always go back and compare different runs as I'm iteratively modifying stuff. So I guess setting different directories is guaranteeing that I never overwrite existing files, but then overwrite_existing = false is only a guardrail for me.

Another utility that I believe is needed is a function that displays the information in an output file.

We might almost already have this with the show function of FieldDataset if it's called with backend=OnDisk(). Maybe it could be shared/re-purposed?

ali-ramadhan avatar Oct 07 '24 14:10 ali-ramadhan

I'm not too picky about add_output_writer! vs. output_writer! although output! is a bit too vague. I might lean too utilitarian though 🤖 Curious what other people think.

I'm starting to feel that it's better with "add"... following the precept that functions are verbs and objects are nouns.

I do actually haha. But what I also do is set different output directories for each run so I can always go back and compare different runs as I'm iteratively modifying stuff. So I guess setting different directories is guaranteeing that I never overwrite existing files, but then overwrite_existing = false is only a guardrail for me.

Ah nice. That sounds like something one would do for an expensive simulation, ie if they take more than a few minutes, right? I think it makes sense to set overwrite_existing = false as a guardrail when working with expensive simulations.

The main thing for me is, if we always start by writing a script with overwrite_existing=true.

We might almost already have this with the show function of FieldDataset if it's called with backend=OnDisk(). Maybe it could be shared/re-purposed?

For sure that makes a lot of sense.

@navidcy @simone-silvestri it'd be nice to hear your thoughts.

glwagner avatar Oct 11 '24 21:10 glwagner

I think overwrite_existing = false could be the default. Seems to me that users could easily add the overwrite_existing = true in their script if they know they are experimenting with something and don't wanna be deleting the output every time. Or they can also use dir = "exp1", etc to separate their experimentations.

navidcy avatar Oct 15 '24 19:10 navidcy

I'm starting to feel that it's better with "add"... following the precept that functions are verbs and objects are nouns.

I find add_output!(simulation) to be more intuitive in terms of creating the output and output!(simulation) actually outputs. (Thus, in the latter "output" is the verb, not the noun.)

navidcy avatar Oct 15 '24 19:10 navidcy

I think overwrite_existing = false could be the default. Seems to me that users could easily add the overwrite_existing = true in their script if they know they are experimenting with something and don't wanna be deleting the output every time.

So, overwrite_existing=false is the current default.

I don't dispute that it's "easy" to add overwrite_existing=true, but I don't think we should design the interface only up to the point where "easy" changes close the gap between the default and what users want to do, most of the time. I think the default should legitimately be useful.

I think that overwrite_existing=false is rarely desirable -- users only want this when they run big, expensive simulations. I would say that it's almost impossible to set up a big, expensive simulation without going through some prototyping phase first. Moreover, prototyping consumes much more human time than big, expensive simulations. Therefore overwrite_existing=true is the correct default.

glwagner avatar Oct 23 '24 18:10 glwagner

Apart from overwrite_existing default, it seems like we have consensus that we should use add_output_writer! for this PR. I'll make that change and switch over the examples as well.

I'll open an issue to discuss the outputinfo utility.

I also think that an output! function would be useful, which simply writes the current state. Recently I have been found myself wanting only the final state of the simulation. It's a little convoluted to have to set up an output writer with a schedule for that task, it'd be easier to write

run!(simulation)
output!("cool_stuff.jld2", simulation)

in the above, the filename goes first because that's the thing being "modified" (similar to how save works)

glwagner avatar Oct 23 '24 18:10 glwagner

I vote to for an add_ prefix in the function name. I don't think it's ugly and it clarifies things a fair bit.

Also, rather than creating new structs for the format, why not use the actual writer constructors in the user interface? i.e. instead of

add_output!(simulation, outputs, NetCDFFormat(); kw...)

use

add_output!(simulation, outputs; writer=NetCDFOutputWriter, kw...)

tomchor avatar Nov 18 '24 18:11 tomchor

Also, rather than creating new structs for the format, why not use the actual writer constructors in the user interface?

In a future where the "writer" is subsurface and users only interact with output through this interface, which do you think would be preferred? Why would we want to keep the "outputwriter" syntax around?

glwagner avatar Nov 19 '24 01:11 glwagner

Also, rather than creating new structs for the format, why not use the actual writer constructors in the user interface?

In a future where the "writer" is subsurface and users only interact with output through this interface, which do you think would be preferred? Why would we want to keep the "outputwriter" syntax around?

I don't think it makes much difference for the user as long as the argument makes it clear what kind of file will be written. So in that sense I think NetCDFWriter does the job as well as NetCDFFormat without creating another struct. I also don't feel strongly about it so feel free to ignore.

tomchor avatar Nov 19 '24 14:11 tomchor

Also, rather than creating new structs for the format, why not use the actual writer constructors in the user interface?

In a future where the "writer" is subsurface and users only interact with output through this interface, which do you think would be preferred? Why would we want to keep the "outputwriter" syntax around?

I don't think it makes much difference for the user as long as the argument makes it clear what kind of file will be written. So in that sense I think NetCDFWriter does the job as well as NetCDFFormat without creating another struct. I also don't feel strongly about it so feel free to ignore.

Hmm yes, the difference mainly then is that the new NetCDFFormat() is instantiated whereas we would not do the same with NetCDFWriter.

If we use symbols then we can be potentially even more clear, eg

add_writer!(simulation, outputs, :netcdf; kw...)
add_writer!(simulation, outputs, :jld2; kw...)
add_writer!(simulation, outputs, :zarr; kw...)

But that does not lend it self to dispatch-based extension.

glwagner avatar Nov 19 '24 16:11 glwagner

I think using a new type is more future-proof. We can change the type NetCDFOutputWriter without changing the UI.

glwagner avatar Nov 19 '24 16:11 glwagner

Update on this: given we changed JLD2OutputWriter to JLD2Writer, I think a good API for this is add_writer!. That's concise enough, and similar to add_callback! so easy to remember I think.

glwagner avatar Jun 26 '25 17:06 glwagner