aospy
aospy copied to clipboard
Relax Proj/Model/Run required hierarchy to arbitrary parent-children relations?
Though the Proj-Model-Run hierarchy (where each Run has a parent Model, and each Model has a parent Proj) is useful, I'm wondering if we (eventually) want to relax any assumptions regarding this model.
Where this would be immediately useful: enabling users to start playing around with aospy with less boilerplate. Currently, they have to specify objects at all three levels. It would be nice if they could just create one Run object and then be able to start submitting calculations. Even with only a single object, they can see the power of our automation via time ranges, annual cycle sub-ranges, etc. This would make the Jupyter Notebook tutorials more easily digestible also: the runs could be introduced first, and then the upper levels introduced incrementally.
Beyond improving the new-user experience, this might also be useful down the road if users come up with use cases that don't adhere to our proj-model-run picture (e.g. for observational data).
This would be a pretty fundamental break though so worth discussing thoroughly. I/O would be affected in addition to several other things.
This relates to #34, which will be addressed in a PR in the not-so-distant future, once I finish #97.
👍 Yes, I'm in full support of experimenting with this.
@spencerkclark brought up some good points about this offline a couple weeks ago that I'll paraphrase here (corrections welcome).
The notion that each Run belongs to a single Model makes intuitive sense, in that any simulation necessarily was executed by a particular numerical model. But from this premise two (largely independent) issues arise:
-
The next level of the hierarchy, i.e. that each Model belongs to a single Proj, is not as intuitive: the same model could be used to generate simulations of relevance to multiple projects. Strictly speaking, aospy does not prevent this, in that there is no check that the
Model
objects passed to a particularProj
weren't previously passed to another one. But that's at least how we (mostly me) have conceptualized it at times. (The next rung up is less problematic: it's hard to imagine a Proj pertaining to more than one Object Library.) -
For contexts like MIPs (Model Intercomparison Projects), where some number of simulations are replicated exactly across many models, it's a huge burden to create separate Run objects for each simulation in each model. A single Run should be able to describe the matching simulations across all the models in that MIP. So in that sense it's not actually the case that, in terms of the aospy objects, a
Run
necessarily has a single parentModel
.Prior to our introduction of
DataLoaders
, I had hacked together a protocol for doing exactly this, for the purpose of looking at CMIP5 results. But that is no longer functional.
<end paraphrasing/begin me thinking out loud>
Both of these ultimately relate to permitting a single object to be the child of multiple parents. Currently the parent-child relationship is hard-coded in to be two-ways (at least in places): when a Proj is given its child Models, Proj.__init__
then goes into each Model
and modifies its parent
attribute. This doesn't really jive with having that Model be the child of multiple Projs.
Maybe the solution here is to implement __copy__
and/or __deepcopy__
methods for the aospy core objects, and have any new parent/child relationship lead to a new copy that enocdes that relationship, thereby leaving the original object as the "template" that others can still grab from.
@spencerahill thanks for updating this issue; good summary overall! One other option I wanted to to get across in our conversation was actually to break the current three-level hierarchy (Proj -> Model -> Run), and move to a flatter hierarchy (Proj -> Run <- Model) (here the arrow defines the direction of the relationship, e.g. A -> B means A is a parent of B).
In this manner a Proj would consist of a collection of Run objects (with each Run pointing to its specific Model object). In this way one could still determine which Models were associated with a particular project (it would just be set([run.model for run in proj.runs])
1), if desired or needed, but you would not encounter any problems if you wanted to use Runs from a particular Model in more than one Proj.
the same model could be used to generate simulations of relevance to multiple projects
Along these lines, I think we should also consider the possibility of using the same Run object in multiple projects (probably less common than the same Model, but still possible); I think if coded appropriately, the flatter hierarchy (at this point I might not even refer to it as a hierarchy) could support that as well.
What are your thoughts on this proposal? Would this break anything we do currently or have designs on doing in the future?
1Note that converting a list of Model objects to a set
relies on aospy core objects being hashable, which they currently are not (possibly a separate issue to consider).
For contexts like MIPs (Model Intercomparison Projects), where some number of simulations are replicated exactly across many models, it's a huge burden to create separate Run objects for each simulation in each model. A single Run should be able to describe the matching simulations across all the models in that MIP. So in that sense it's not actually the case that, in terms of the aospy objects, a Run necessarily has a single parent Model.
Do you think it could be worth creating a separate type of object here? It's a little confusing (to me at least) to have a Run mean something different depending on the context (for me it's most intuitive for it to refer to a single simulation).
In this manner a Proj would consist of a collection of Run objects (with each Run pointing to its specific Model object). In this way one could still determine which Models were associated with a particular project (it would just be set([run.model for run in proj.runs])1), if desired or needed, but you would not encounter any problems if you wanted to use Runs from a particular Model in more than one Proj.
This is intriguing. So, in essence, you get Proj.models
for free by inferring all of the models from Proj.runs
. And Proj.runs
(which does not currently exist) becomes the more “fundamental” attribute. Right?
The existing behavior is that effectively all of a Model’s Runs become part of the Proj (although there’s no Proj-level attr that stores this). What if we retain this by, if a Model is added to a Proj, each of its child Runs gets added to Proj.runs
? In other words, adding a Model becomes effectively a convenience function for adding all of its child Runs to Proj.runs
(in addition to the less important modification of Proj.models
).
This requires resolution though of the issue of using a single Run for multiple Models.
What are your thoughts on this proposal? Would this break anything we do currently or have designs on doing in the future?
I think we could implement this without users feeling much change, although I'm not totally confident about that. In terms of the future, I'm feeling like there are several tangled high-level things we're considering (i.e. #88, #156, #34, #154), all of which require progress (or at least will benefit) from some of the others, so this discussion seems only positive.
Do you think it could be worth creating a separate type of object here? It's a little confusing (to me at least) to have a Run mean something different depending on the context (for me it's most intuitive for it to refer to a single simulation).
This will certainly require a new DataLoader, but I hadn’t thought about a new type. My initial reaction is to avoid adding a new type unless we really need to (or at least hiding that implementation detail from the user, i.e. have Run be subclassed by something like SingleRun
and RunTemplate
types).
Note that converting a list of Model objects to a set relies on aospy core objects being hashable, which they currently are not (possibly a separate issue to consider).
This is a good point. IIRC you dealt with hashes some when experimenting with a DB implementation. What’s your take, is it reasonable to consider Proj/Model/Runs immutable & hashable (and therefore fair game to implement hash)? My intuition is yes: I don’t see how the core attributes that differentiate them from others of their same type would change during their lifetimes.
But, now that I think about it, does this require preventing setting of those core attributes after instantiation (i.e. by making them properties and using @property.setter). Consider model1 = Model(‘model’, runs=None, …)
followed later by model1.runs = [run1, run2]
. Shouldn’t that change the hash? As a concrete example, the Examples page does this with proj.regions
.
If you think this is worth pursuing further, let's open a separate issue for it.
And
Proj.runs
(which does not currently exist) becomes the more “fundamental” attribute. Right?
Yes, this is exactly what I had in mind.
What if we retain this by, if a Model is added to a Proj, each of its child Runs gets added to
Proj.runs
? In other words, adding a Model becomes effectively a convenience function for adding all of its child Runs toProj.runs
(in addition to the less important modification ofProj.models
)
In this case would it be too much trouble to simply do something like:
from runs import run_A, run_B
from models import example_model
example_proj = Proj(
name='example',
runs=[run_A, run_B] + example_model.runs,
...
)
I feel like something like that would be more flexible, and easier to code and document (we'd basically get it for free). Addressing #88 could also help here, in that one could select subsets of Runs based on tags from a particular Model object and append those to the list of Runs associated with a Proj.
This will certainly require a new DataLoader, but I hadn’t thought about a new type. My initial reaction is to avoid adding a new type unless we really need to (or at least hiding that implementation detail from the user, i.e. have Run be subclassed by something like
SingleRun
andRunTemplate
types).
Forgive me, I'm not super-familiar with this use-case and how it was set up before; in the past would you create separate Model objects for each Model that was included in a MIP, and then use a single Run object to interface with all of them? Would it be possible instead to create the many, similar (with the exception of the Model used), Run objects using a basic list-comprehension in your object library? Then you could just pass that list to your main script to do calculations on all of those Runs (this way you would not need to break notion that a Run was always a single-simulation that was carried out by a single Model).
But, now that I think about it, does this require preventing setting of those core attributes after instantiation (i.e. by making them properties and using @property.setter). Consider
model1 = Model(‘model’, runs=None, …)
followed later bymodel1.runs = [run1, run2]
. Shouldn’t that change the hash? As a concrete example, the Examples page does this withproj.regions
.
Indeed, I think we need to think more carefully about this. Typically mutable collections are not hashable; therefore, in the strict sense I think perhaps we'd have to think against making Proj and Model objects hashable. I don't think this precludes pursuing the modified hierarchy path, however.
Indeed, I think we need to think more carefully about this. Typically mutable collections are not hashable; therefore, in the strict sense I think perhaps we'd have to think against making Proj and Model objects hashable. I don't think this precludes pursuing the modified hierarchy path, however.
I agree, this is largely orthogonal. I opened #170 to discuss this.
in the past would you create separate Model objects for each Model that was included in a MIP, and then use a single Run object to interface with all of them?
Basically yes: one Model per GCM, and one Run per simulation specification in that MIP. And then I had logic that, in a Calc, combined the Run and Model attributes in order to find the data for that run in that model.
Would it be possible instead to create the many, similar (with the exception of the Model used), Run objects using a basic list-comprehension in your object library?
This isn't a bad idea. Definitely seems like the simplest solution. Let me play around with this and see how it feels (I actually need to do some new calculations on CMIP data for a project anyways).
I feel like something like that would be more flexible, and easier to code and document (we'd basically get it for free).
Yes, this would be totally acceptable. I still would like a Proj.models
attribute, even if it's not used by aospy and is purely for the user's information. It just seems to me that somebody should be able to get a list of all the Models used in a Proj
My other concern is limiting API breaking. I.e. for those who have existing object libraries (which potentially is n>2 at this point!), I'd prefer to have Proj('name', models=...)
to not raise. Or at least do a deprecation cycle?
This isn't a bad idea. Definitely seems like the simplest solution. Let me play around with this and see how it feels (I actually need to do some new calculations on CMIP data for a project anyways).
Cool, let me know how it goes.
I still would like a
Proj.models
attribute, even if it's not used by aospy and is purely for the user's information. It just seems to me that somebody should be able to get a list of all the Models used in a Proj
Yes, if we go this route this shouldn't be too hard to add (even if we decide against making Model objects hashable).
My other concern is limiting API breaking. I.e. for those who have existing object libraries (which potentially is n>2 at this point!), I'd prefer to have Proj('name', models=...) to not raise. Or at least do a deprecation cycle?
Understood :). Nevertheless, do you think this change is worth pursuing?
Sorry for the delay on this. Just went through this whole thread again.
Understood :). Nevertheless, do you think this change is worth pursuing?
Yes. To summarize, I'd say we have converged on adding Proj.runs
. However, I realize that the implementation of my desired models
add on still has some unaddressed implementation issues:
-
If
Proj.runs
is set directly, e.g.Proj('name', runs=[myrun1, myrun2, ...], ...)
. how would we populate themodels
attribute without the objects being hashable? Do we go bymodel.name
? -
Conversely, if we enable my suggested
Proj.models
convenience function,Proj('name', models=[mymodel1, mymodel2, ...], ...)
,Proj.runs
gets set as the union of the elements in theruns
attributes of all the Models. In this case, hashing isn't an issue, since each Model will have distinct child Runs. -
How do we handle if both
Proj.models
andProj.runs
are given values? The concern here is duplicate values, i.e. a Model is included in models and one of its Runs is also included in runs. Do we raise? That seems excessive. Warn? -
Just to reiterate, in a break from the current setup, it would no longer be the case that every
Run
that is a child of one of the Models inProj.models
would "belong" to theProj
.Proj
would not internally usemodels
for essentially anything; it's purely for the user's information.
If
Proj.runs
is set directly, e.g.Proj('name', runs=[myrun1, myrun2, ...], ...)
. how would we populate themodels
attribute without the objects being hashable? Do we go bymodel.name
?
Yes, this was my thinking here. I think we are already implicitly assuming that Model names have to be unique given the way we name output files (i.e. if one tried to do the same calculation in the same Proj with two distinct Model objects with the same name attribute, one saved file would overwrite the other).
Conversely, if we enable my suggested
Proj.models
convenience function,Proj('name', models=[mymodel1, mymodel2, ...], ...)
,Proj.runs
gets set as the union of the elements in theruns
attributes of all the Models. In this case, hashing isn't an issue, since each Model will have distinct child Runs.
How do we handle if both
Proj.models
andProj.runs
are given values? The concern here is duplicate values, i.e. a Model is included in models and one of its Runs is also included in runs. Do we raise? That seems excessive. Warn?
Above I suggested:
from runs import run_A, run_B
from models import example_model
example_proj = Proj(
name='example',
runs=[run_A, run_B] + example_model.runs,
...
)
as an alternative to this convenience function that would prevent having to deal with this issue. My thinking was that it's more explicit and would make it clearer to the user that the models
attribute of a Proj
is not required to be specifically specified. Though I do agree that it could make Proj-creation code a little cleaner to have a convenience method. What do you think about the trade-offs here?
I think we are already implicitly assuming that Model names have to be unique given the way we name output files (i.e. if one tried to do the same calculation in the same Proj with two distinct Model objects with the same name attribute, one saved file would overwrite the other).
Yes. Maybe this is separate issue, but I think we should eventually be checking more explicitly for this. I think we could use id
, e.g.
for run1 in proj.runs:
for run2 in proj.runs:
if id(run1.model) != id(run2.model) and run1.model.name == run2.model.name:
raise ValueError("Two models have the same name: {0} and {1}".format(run1.model, run2.model))
(I suspect there's more elegant/idiomatic way of doing this.) This is also an issue at the Run (and even Proj) levels. This could be a major headache if somebody inadvertently ended up overwriting a bunch of their results.
What do you think about the trade-offs here?
Actually, in the spirit of The Zen of Python ("There should be one-- and preferably only one --obvious way to do it.") let's just hold off the models
route for now. However, in it's place, we'll need to include in the docs an example of how to do this.
Thinking out loud, maybe we could add a "Recipes" or "Templates" section to the docs where we put commonly needed things like this.