cosima-cookbook icon indicating copy to clipboard operation
cosima-cookbook copied to clipboard

Should the cookbook default to stricter requirements when merging/concatenating data?

Open dougiesquire opened this issue 2 years ago โ€ข 8 comments

Motivating example here: https://github.com/COSIMA/cosima-recipes/issues/229. Files with the same naming in a single experiment are on different domains depending on the output* directory. Should the cookbook check whether indexes are the same for data being merged/concatenated?

E.g. passing join="exact" to the open_mfdataset() call within the cosima-cookbook will then return an error when indexes to be aligned are not equal. This could currently be passed through kwargs, but should it be the default?

dougiesquire avatar Jan 25 '23 00:01 dougiesquire

Thanks for catching this @dougiesquire.

I'm not sure this is a cookbook issue. I think it's more an issue with the data itself. I don't think it's a good idea to have output defined on different regions using the same file name. I'd suggest that a good way to deal with this issue is to rename the ocean_daily_3d_u_%.nc files in output196-output279 to something like ocean_daily_3d_u_southern_ocean_%.nc. They can then be separated using the nc_file argument to cc.querying.getvar.

But I guess even then, it would still be useful to flag it so that the user knows they have to use nc_file.

rmholmes avatar Jan 25 '23 00:01 rmholmes

Thanks @rmholmes. I wasn't meaning to suggest that the issue is with the cookbook, but having join="exact" as default would've saved me a bunch of time yesterday. I.e. it could be useful for helping to find/flag issues with the data.

My guess is that most uses of the cookbook are to query/load datasets that should have consistent indexes. So having join="exact" as default could make sense - users could always override the kwarg if they want to merge inconsistent data. But, I'm probably just not across the full range of cookbook use cases.

dougiesquire avatar Jan 25 '23 00:01 dougiesquire

But yes, for fixing the specific issue with 01deg_jra55v13_ryf9091, changing the name of the nc files sounds sensible to me. Who would be in charge of doing that?

dougiesquire avatar Jan 25 '23 00:01 dougiesquire

Sorry @dougiesquire, I didn't completely take in your comment here as I'd copied my response across from the cosima-recipes issue you'd put up. I'd support a move to the stricter requirements.

I think @AndyHoggANU ran that simulation.

rmholmes avatar Jan 25 '23 00:01 rmholmes

I agree that join="exact" seems like a sensible default, it's a bit crazy that xarray tries to concatenate datasets like that in the first place!

angus-g avatar Jan 25 '23 01:01 angus-g

Is there any time penalty with join="exact"? ISTR folks complaining about xarray doing time consuming checks on coordinates under some circumstances, but it may be me misremembering, or the issue may no longer be a problem.

aidanheerdegen avatar Jan 25 '23 01:01 aidanheerdegen

I think the checks by compat are more expensive than those for join (which just tries to align dimension sizes)? Probably one of those things where it's best to just benchmark it.

angus-g avatar Jan 25 '23 01:01 angus-g

Agreed. I wouldn't expect any difference in speed for data that can be joined.

As @angus-g mentioned, there are other kwargs that can be changed to improve performance, but they require making some assumptions about the data being loaded. I don't know whether these are justified for the COSIMA data?

EDIT: see the Note here: https://docs.xarray.dev/en/stable/user-guide/io.html#reading-multi-file-datasets

dougiesquire avatar Jan 25 '23 01:01 dougiesquire