cli icon indicating copy to clipboard operation
cli copied to clipboard

`nextstrain run`: unmanaged pathogen support?

Open tsibley opened this issue 1 year ago • 9 comments

Support nextstrain run for unmanaged pathogen sources, i.e. working copies of our repos, to aid development.

Two avenues to consider for such support.

  1. "Editable" setups: nextstrain setup pointed at local directory (like Python's "editable installs")
  2. No setup required: nextstrain run pointed at a local directory (like nextstrain build)

"Editable" setups

Accept a local file path/URL when setting up a version.

nextstrain setup avian-flu@dev=file://$PWD/avian-flu
nextstrain setup avian-flu@dev=$PWD/avian-flu
nextstrain setup avian-flu@dev=./avian-flu

cd avian-flu
nextstrain setup avian-flu@dev=.

Invoked the same as any other version.

nextstrain run avian-flu@dev segment-focused /tmp/analysis

Can be approximated currently by:

$ ln -sv ~/nextstrain/measles ~/.nextstrain/pathogens/measles/local="$(echo -n local | base32)"
$ nextstrain run measles@local …

No setup required

Distinguish managed pathogens vs. non-managed pathogens by absence/presence of a path separator.

# Managed (current)
nextstrain run avian-flu segment-focused /tmp/analysis
nextstrain run avian-flu@v2 segment-focused /tmp/analysis

# Unmanaged (new)
nextstrain run ./avian-flu segment-focused /tmp/analyis

cd avian-flu
nextstrain run ./ segment-focused /tmp/analysis

and maybe special case . too for convenience?

nextstrain run . segment-focused /tmp/analysis

Note that we'd still keep pathogen and workflow as separate arguments (i.e. rather than combining them the way nextstrain build does) so that workflows are still names and can be resolved via pathogen registration metadata in the future.

History

I'd long anticipated some sort of "editable install" feature for nextstrain setup to support pathogen development, but several review comments spurred me to think about supporting directory paths directly in nextstrain run.

tsibley avatar Mar 12 '25 21:03 tsibley

@jameshadfield @joverlee521 You're implicated in previous discussion/commentary here and so I'd definitely appreciate any thoughts you have.

tsibley avatar Mar 12 '25 21:03 tsibley

Huh, I was not expecting to use nextstrain run for development purposes, I thought that will remain as nextstrain build...the unmanaged pathogen support blurs the lines between the two commands that confuses me.

Am I understanding correctly that the main feature from nextstrain run that is helpful for development is the ability to use an external analysis directory? Could we just add this functionality to nextstrain build and keep nextstrain run simple?

Or should we port all of nextstrain build functionality to nextstrain run and deprecate the nextstrain build command?

joverlee521 avatar Mar 12 '25 22:03 joverlee521

It definitely blurs the lines a bit. There's still a line though: nextstrain build gives you direct access to Snakemake options and conceptually is oriented as a Snakemake wrapper. nextstrain run does not and conceptually is more removed from Snakemake.

I think external analysis directory is the main thing at the moment, yeah. This was driven by @jameshadfield's desired use cases. Adding support for external analysis directories to nextstrain build means adding an option, e.g. --analysis-dir or --workdir, and either a) doing some more assuming than previously about Snakemake setups¹ (e.g. fixed Snakefile instead of Snakemake's default behaviour of searching for the alternates snakefile, workflow/Snakefile, and workflow/snakefile) or b) doing a fair bit of reworking of assumptions on the AWS Batch entrypoint. All doable, but it seemed maybe better (and certainly easier) to bring unmanaged pathogen support to nextstrain run instead of bringing analysis dirs to nextstrain build. Maybe not though! (Or maybe we should do both?)

I don't think we should port all of nextstrain build to nextstrain run and deprecate build. Different interfaces for different purposes are ok, in my mind, and I don't see a reason to force anyone to change away from nextstrain build if it's working for them.

tsibley avatar Mar 13 '25 19:03 tsibley

As long as the two commands are separate, another reason this support seems useful is that it means nextstrain run can be tested during development (alongside nextstrain build use) without the slow, frustrating iteration loop of pushing stuff to GitHub and running nextstrain setup or nextstrain update again.

tsibley avatar Mar 13 '25 20:03 tsibley

Different interfaces for different purposes are ok, in my mind, and I don't see a reason to force anyone to change away from nextstrain build if it's working for them.

True! After thinking on this more, I see the need for supporting development with nextstrain run. If we eventually want all pathogen repos to work with nextstrain run, it makes more sense to test with it during development!

Personally, I do reach for the various Snakemake options when testing workflow changes, so I will have to figure out what's the best development cycle here...

it means nextstrain run can be tested during development (alongside nextstrain build use) without the slow, frustrating iteration loop of pushing stuff to GitHub and running nextstrain setup or nextstrain update again.

That sounds like good reason to go with your proposed "No setup required" version.

joverlee521 avatar Mar 13 '25 21:03 joverlee521

doing some more assuming than previously about Snakemake setups¹

@tsibley what does the ¹ refer to?

victorlin avatar Mar 13 '25 23:03 victorlin

If nextstrain build is intended to be "dev mode" for all scenarios, including external analysis directories, it seems necessary for that command to support external analysis directories. Otherwise, users would need to copy/link config files back to the pathogen repo to run something like nextstrain build . segment-focused --forceall, which would be cumbersome.

victorlin avatar Mar 13 '25 23:03 victorlin

@joverlee521 wrote:

without the slow, frustrating iteration loop of pushing stuff to GitHub and running nextstrain setup or nextstrain update again.

That sounds like good reason to go with your proposed "No setup required" version.

To make sure we have a shared understanding, an "editable" setup doesn't have that slow, frustrating iteration loop either: it's got a one-time initial setup for a local repo but doesn't need re-setup after changes. Only the status quo of #407 without any of the support discussed here has the slow loop.

@victorlin wrote:

what does the ¹ refer to?

orz

The immediately following parenthetical, which I'd initially thought to make a footnote but then decided was more relevant than that.

If nextstrain build is intended to be "dev mode" for all scenarios,

I don't see the nextstrain build vs. nextstrain run distinction as "dev mode" vs. "non-dev mode", although maybe in practice that'll end up being the primary distinction? Should it be the main distinction we make?

I do see nextstrain run as "higher-level" than nextstrain build, but not just along a development/non-development axis.

But point taken that we'll probably find it limiting to not have nextstrain build support analysis directories outside of the pathogen repo, and so will probably need to add support there. I did some additional work in https://github.com/nextstrain/docker-base/pull/245 that moves towards my "b) doing a fair bit of reworking of assumptions on the AWS Batch entrypoint" scenario above, so maybe not so bad to do the rest too.

tsibley avatar Mar 14 '25 19:03 tsibley

Very brief comments from me (although I haven't thoroughly read everything in this issue yet)

I think external analysis directory is the main thing at the moment, yeah. This was driven by @jameshadfield's desired use cases.

Locally I'll probably still use snakemake directly, but being able to use AWS for dev work will be really nice. nextstrain run ~/.../avian-flu seems a good direction to aim for, but I have no problem symlinking for the short/medium/long/forever term.

Being able to pass args to the underlying runner (snakemake) would be useful, but not a pressing priority IMO. I don't have commentary on whether that's build or run.

jameshadfield avatar Mar 15 '25 03:03 jameshadfield

Reviewing this again today, I'm picking option (2) no setup required: nextstrain run can be pointed at a local pathogen directory (like nextstrain build).

It is simpler to explain and use and simpler to implement (e.g. no need to exclude local "editable" setups in nextstrain update).

Relatedly, for third-party pathogen repo support, I don't want to single-source GitHub as org/repo, so we won't be using path separators as the marker for a third-party repo.

tsibley avatar Sep 17 '25 20:09 tsibley