pangeo-forge-recipes Consolidate `-recipes`, `-runner`, and `meta-yaml-schema` into monorepo

Currently the following are developed in separate repos:

https://github.com/pangeo-forge/pangeo-forge-recipes
https://github.com/pangeo-forge/pangeo-forge-runner
https://github.com/pangeo-forge/meta-yaml-spec

As discussed at today's Coordination Meeting, this makes integration testing of these inter-connected components challenging, leading to frequent scenarios in which the latest release of -runner is broken when used alongside the latest release of -recipes. It also means that, in order to be deployed to Dataflow (a common development need), a given -recipes PR frequently needs a complementary -runner PR. See https://github.com/pangeo-forge/pangeo-forge-runner/pull/64 for an example of this type of challenge.

In order to iterate more quickly and effectively on both -recipes and -runner, more closely tying the development and testing cycles of these two repos together is essential. The attendees of today's Coordination Meeting expressed broad support for the use of a monorepo to achieve this goal. It was also noted that the question of where to manage a centralized Pangeo Forge documentation directory would be largely solved by combining these repos.

Furthermore, the goal (which we have previously discussed) of making -runner the recommended/default entrypoint for executing -recipes (which would result in less deviation between local/testing and cloud deployments) would be served by a tightly integrated monorepo.

The meta-yaml-spec repo is not currently used by -runner, but my proposal is that it should be included in this monorepo, in the interest of formalizing what -runner should expect to find in meta.yaml files.

@yuvipanda, you have previously expressed reluctance to use a monorepo, given (IIUC) the risks of added complexity (of git, e.g.) for developers, and also reduced modularity. You were not able to attend today's Coordination Meeting, so I'm tagging you specifically here in the event you'd like to weigh in with a dissenting view, which I of course welcome. I will confess that while I did initially resist this idea as well, it increasingly seems like the best way of addressing the (other complexity hurdle) of synchronizing the development of these closely-linked packages.

Based on today's discussion, the group seemed to be in favor of still releasing -recipes and -runner as separate packages from a monorepo, which makes sense to me, as this salvages some of the benefits of modularity while also keeping testing/development tightly knit.

Apr 24 '23 22:04 cisaacstern

My reluctance is primarily against bundling the orchestrator into the same monorepo as the other ones. I personally don't think moving these to a monorepo will actually do much to help with the integration problems - as long as we must support recipes that are using older versions of pangeo-forge-recipes with newer versions of runner, these problems will exist. I think using a monorepo will cause more confusion there, as in general when using a monorepo there isn't really cross-version compatibility expected. But given recipes don't exist in this same repo, we need to provide some compatibility there for them. So I'm not sure a monorepo will help much here.

I also personally think that the meta-yaml-spec should just be part of recipes or runner, rather than its own repo.

From a prioritization perspective, do you think moving to a monorepo or making a decision about that is a blocker for getting beam-refactor merged?

Apr 25 '23 15:04 yuvipanda

against bundling the orchestrator into the same monorepo

Thanks for the reminder. 💯 agree!

as long as we must support recipes that are using older versions of pangeo-forge-recipes with newer versions of runner, these problems will exist

Great point, which nobody raised during the Coordination meeting discussion.

What do you think about dropping this design requirement, and instead bundling recipes + runner together as a single package? In this configuration, runner could be an optional install, via something like pip install pangeo-forge[cli].

This would introduce the need to dynamically install runner (or, pangeo-forge[cli]) for every deploy, though I've actually been doing this commonly anyway (since I frequently find myself installing from development branches of runner as it is).

IMO this seems like a good idea, if it provides us the opportunity for more rigorous integration testing. As one further example, the discussion in https://github.com/pangeo-forge/pangeo-forge-recipes/pull/486#issuecomment-1504025885 highlights how any additional "writer" transforms potentially introduce the need for corresponding runner PRs, insofar as they affect the ast re-writing logic there. In #486 I've made some assumptions about how that will work in runner which I think will pan out, but don't have any way to easily test those assumptions as part of #486.

I also personally think that the meta-yaml-spec should just be part of recipes or runner, rather than its own repo.

Agreed, runner seems like a natural fit (if runner and recipes remain separate). Or, as mentioned above, all three could be in one repo (assuming we drop the "runner can deploy multiple versions of recipes" requirement, which is appealing to me).

do you think moving to a monorepo or making a decision about that is a blocker for getting beam-refactor merged?

Nope, just thinking ahead here.

Apr 25 '23 20:04 cisaacstern

I think the fundamental question to answer really is 'how will we support recipes that are using versions of pangeo_forge_recipes different from latest?'. Can you expand a little more on what you mean by:

What do you think about dropping this design requirement

Which one? Supporting older versions of pangeo_forge_recipes with newer versions of runner? How would this work for orchestrator?

This is an important problem to solve and think through, and fairly orthogonal to monorepos IMO. This is why I think the amount of work we'll have to put into moving things into a monorepo won't actually pay off here - this complexity of multiple versions needing to be supported is the essential problem of our integration tests, the fact that they're in different repos is merely accidental.

Apr 25 '23 20:04 yuvipanda

What do you think about dropping this design requirement

Which one? Supporting older versions of pangeo_forge_recipes with newer versions of runner?

Yes, this one.

I am proposing that recipes is packaged together with the runner cli, so deploying a given repo requires having the version of runner specified by that repo installed. (In this scenario, runner would not actually be it's own package, it would just be part of a monolithic pangeo-forge package.)

How would this work for orchestrator?

Any process (manual/human, programmatic/orchestrator) would need to first introspect the repo to make sure they have the right version of the runner cli installed. This is actually probably easier for orchestrator (or a GitHub Action, etc.), as it's just a matter of checking out the requirements.txt. For a human, it could be a bit cumbersome.

this complexity of multiple versions needing to be supported is the essential problem of our integration tests, the fact that they're in different repos is merely accidental.

Yes, this is is incredibly complex. Verging on unmanageably so, IMO. Which is why I'm wondering if we should simplify it by just packaging runner together with recipes into a single pangeo-forge package.

Of course what we lose here is the ability to have a single runner installation able to deploy floating versions of recipes, and we give the operator (human, programmatic/orchestrator, etc.) the additional overhead of needing to dynamically install the cli itself (not just recipes) at deploy time.

But in exchange for these losses, we gain the ability to actually test recipes + runner together.

If recipes were relatively narrowly focused on pipelines that terminate in StoreToZarr, this would be less of a concern to me. But as #486 highlights, we do not actually currently know the names of all of the objects into which we want to inject runtime arguments. It feels very difficult to me to develop alternative outputs/targets (which by definition require runtime-injected arguments) without tightly coupling that development to the tooling which will actually be doing that injection.

Does that make sense?

Apr 25 '23 21:04 cisaacstern

What do you think about dropping this design requirement

Which one? Supporting older versions of pangeo_forge_recipes with newer versions of runner?

Yes, this one.

I am proposing that recipes is packaged together with the runner cli, so deploying a given repo requires having the version of runner specified by that repo installed. (In this scenario, runner would not actually be it's own package, it would just be part of a monolithic pangeo-forge package.)

From the recipe author perspective, a single pangeo-forge package makes sense, as it's a single package to install (or repo to clone) for recipe development. Ensuring at least one run of a recipe with the runner cli prior to recipe contribution should uncover issues earlier than has previously been the case (e.g. sandbox-based development) where an author may be unaware of runner.

How would this work for orchestrator?

Any process (manual/human, programmatic/orchestrator) would need to first introspect the repo to make sure they have the right version of the runner cli installed. This is actually probably easier for orchestrator (or a GitHub Action, etc.), as it's just a matter of checking out the requirements.txt. For a human, it could be a bit cumbersome.

If we used a single pangeo-forge package, I assume this introspection wouldn't be required during recipe development? I.e. a clone of the repo/pangeo-forge package installation would ensure the right cli version was installed?

this complexity of multiple versions needing to be supported is the essential problem of our integration tests, the fact that they're in different repos is merely accidental.

Yes, this is is incredibly complex. Verging on unmanageably so, IMO. Which is why I'm wondering if we should simplify it by just packaging runner together with recipes into a single pangeo-forge package.

Of course what we lose here is the ability to have a single runner installation able to deploy floating versions of recipes, and we give the operator (human, programmatic/orchestrator, etc.) the additional overhead of needing to dynamically install the cli itself (not just recipes) at deploy time.

Do we need to have a single runner able to deploy floating versions? If a recipe required a particular pangeo-forge version, then the correct cli would be automatically available and used if included in the package?

But in exchange for these losses, we gain the ability to actually test recipes + runner together.

The latter is a significant advantage.

Apr 26 '23 09:04 derekocallaghan

From the recipe author perspective, a single pangeo-forge package makes sense, as it's a single package to install (or repo to clone) for recipe development. Ensuring at least one run of a recipe with the runner cli prior to recipe contribution should uncover issues earlier than has previously been the case (e.g. sandbox-based development) where an author may be unaware of runner.

💯 agreed. @yuvipanda has long advocated to make the pangeo-forge-runner CLI (or its successor) the default for local execution as well, precisely to mitigate some local-to-cloud migration issues. While not required to make this a reality, having the CLI in the same monorepo as recipes makes this easier in terms of testing and documentation, IMO. (Of course, assuming the CLI cannot deploy floating versions of recipes.)

If we used a single pangeo-forge package, I assume this introspection wouldn't be required during recipe development? I.e. a clone of the repo/pangeo-forge package installation would ensure the right cli version was installed?

Correct. This would primarily come as an issue for deployment, not development.

Do we need to have a single runner able to deploy floating versions? If a recipe required a particular pangeo-forge version, then the correct cli would be automatically available and used if included in the package?

Yes, from the recipe developer's perspective, this is if anything a simplification.

Where a bit of complexity is added is for the recipe deployer. If we envsion some institution/company/operator managing deployment of a large collection of Pangeo Forge feedstock repo's, then presumably having a single CLI which can interact with all of them (or a large portion of them) could be an advantage.

To take an example from web development, in the scenario where recipes and runner remain decoupled, recipes plays the role of FastAPI/Flask, etc. (the application layer) and runner plays the role of Uvicorn/Gunicorn etc. (the server/deployment layer). There is some loose version compatibility there, but for the most part recent-ish versions of these servers can deploy most/all recent-ish versions of those applications.

The decoupled design is undeniably elegant in its modularization, but specifically I would say the fact that runner must rewrite the recipe module's abstract syntax tree (ast) to inject runtime parameters, makes our case distinct, and suggests the need/value for tighter coupling.

A key constraint of the ast-rewriting is that pangeo-forge-runner needs to know:

Names of the recipes objects (e.g., PTransforms) which will require runtime parameter injection
Names of the parameters on those objects to inject values for

If these names were going to be relatively stable, then this might be less of a concern, but #486 demonstrates that we probably do want the flexibility to easily iterate on the variety of transforms which will require parameter injection. Even though #486 is merged, recipes that leverage its transforms are not currently deployable with pangeo-forge-runner, because a complementary PR will be required there to make the CLI aware of the newly-added PTransform names in which to inject runtime params. Had we been in a monorepo, this fix could have been part of #486.

But in exchange for these losses, we gain the ability to actually test recipes + runner together.

The latter is a significant advantage.

Agreed.

May 05 '23 19:05 cisaacstern

I believe an alternative to the monorepo (i.e. tighter coupling) might be even looser coupling, which might mean exchanging the use of ast-rewriting for runtime parameter injection (which requires knowledge of recipes object names) for some more primitive method, such as environment variables, jinja templating, or similar. There probably are reasons why these approaches were not used initially, but @yuvipanda may be able to speak more to that.

May 05 '23 19:05 cisaacstern

The even looser coupling is accomplished via https://github.com/pangeo-forge/pangeo-forge-runner/pull/86 :)

Aug 25 '23 18:08 yuvipanda

pangeo-forge-recipes pangeo-forge-recipes copied to clipboard

Consolidate `-recipes`, `-runner`, and `meta-yaml-schema` into monorepo

pangeo-forge-recipes
pangeo-forge-recipes copied to clipboard