flux-core icon indicating copy to clipboard operation
flux-core copied to clipboard

WIP: modprobe: efficient and extensible flux startup and shutdown

Open grondo opened this issue 7 months ago • 8 comments

This experimental PR is a prototype of a replacement for the Flux rc startup/shutdown system. It started with a few goals:

  • parallelize module loading and other rc "tasks" to speed up startup
  • improve upon the method of overriding modules that have alternatives, like sched and feasiblity
  • easily restrict modules/tasks to certain ranks

I'm somewhat satisfied with the interface in this prototype, so I'm posting it early for feedback on that aspect of the design, before going on to do documentation and tests. I'm looking for feedback on the overall scheme here, and if it would an acceptable path forward.

I apologize, the description below is lengthy:

The proposed design here consists of 3 components: A TOML configuration specification for expressing modules and their relationships and requirements: modules.toml and etc/modules.d/*.toml, a new Python interface to define tasks that run during rc1 and rc3, and finally a flux modprobe command that processes the previous two files and runs the tasks defined for the runlevel as efficiently as possible. Each of these components is described in more detail below:

modules.toml

The modules.toml file defines modules in Flux, whether they require other modules, any broker attrs/config they need, and the ranks on which they should be loaded. The flux-core modules are all defined in /etc/flux/modules.toml, and extra modules can be defined in modules.d/*.toml. The file currently contains one [[modules]] array, each entry of which defines a module and supports the following keys (this is taken from the top of the existing modules.toml)

#   name     (required) The module name. This will be the target of module load
#            and remove requests. If there is a name collision between entries,
#            then the last one loaded will be used.
#
#   module   (optional) The module to load if different from name.
#
#   provides (optional) List of services this module provides, e.g. "sched".
#             If multiple modules provide the same service name, the last one
#             loaded takes precedence by default, though this can be influenced
#             by configuration or broker attributes.
#
#   args     (optional) An array of module arguments.
#
#   ranks    (optional) The set of ranks on this module should be loaded.
#            May either be an RFC 22 idset string, or a string beginning with
#            `>` or `<` followed by a single integer. (e.g. ">0" to load a
#            module on all ranks but rank 0.
#
#   requires  (optional) An array of services this module requires, i.e. if
#             this module is loaded then services/modules in requires will also
#             be loaded. If this module also has to be loaded after any
#             required modules, add them to the after array as well.
#
#   after     (optional) An array of modules for which this module must be
#             loaded after. If this module also requires the service or module
#             it is loaded after, then the module must also be added to
#             `requires`.
#
#   needs     (optional) An array of modules which are required for this module
#             to be loaded.
#
#   needs-config (optional) An array of configuration keys in dotted key
#                form which are required for this module to be loaded.
#                If the key is not set, then the module is skipped.
#
#   needs-attrs  (optional) Same as with needs-config, but for broker
#                attributes.

Here's an example module:

[[modules]]
name = "cron"
ranks = "0"
requires = ["heartbeat"]
args = ["sync=heartbeat.pulse"]

This entry defines the cron module, only loaded on rank 0. The cron module requires the heartbeat mdoule, and it should be loaded by default with the args sync=heartbeat.pulse.

One caveat: the provides documentation claims to support a way to override the default alternative, but that is not yet implemented.

modprobe rcX.py files

flux modprobe replaces the rcX scripts with a Python file that defines a set of tasks to run and the relationships between those tasks so that interdependent tasks are run in the correct order. Non-module tasks are defined in an Python file via the @task decorator:

def task(name, **kwargs):
    """
    Decorator for modprobe "rc" task functions.

    This decorator is applied to functions in an rc1 or rc3 python
    source file to turn them into valid flux-modprobe(1) tasks.

    Args:
    name (required, str): The name of this task.
    ranks (required, str): A rank expression that indicates on which
        ranks this task should be invoked. ``ranks`` may be a valid
        RFC 22 Idset string, a single integer prefixed with ``<`` or
        ``<`` to indicate matching ranks less than or greater than a
        given rank, or the string ``all`` (the default if ``ranks``
        is not specified). Examples: ``0``, ``>0``, ``0-3``.
    requires (options, list): An optional list of task or module names
        this tasnk requires. This is used to ensure required tasks are
        active when activating another task. It does not indicate that
        this task will necessarily be run before the tasks it requires.
        (See ``before`` for that feature)
    needs (options, list): Disable this task if any task in ``needs`` is
        not active.
    provides (optional, list): An optional list of string service name
        that this task provides. This can be used to set up alternatives
        for a given service. (Mostly useful with modules)
    before (optional, list): A list of tasks or modules for which this task
        must be run before.
    after (optional, list) A list of tasks or modules for which this task
        must be run after.
    needs_attrs (optional, list): A list of broker attributes on which
        this task depends. If any of the attributes are not set then the
        task will not be run.
    needs_config (optional, list): A list of config keys on which this
        task depends. If any of the specified config keys are not set,
        then this task will not be run.

    Example:
    ::
        # Declare a task that will be run after the kvs module is loaded
        # only on rank 0
        @task("test", ranks="0", needs=["kvs"], after=["kvs"])
        def test_kvs_task(context):
            # do something with kvs
    """

The context here is a flux.modprobe.Context object which is shared between all tasks, it contains some convenience attributes and methods to get a shared Flux handle, send an rpc, or run something under bash, as well as offering a way to get broker attributes, config, and share arbitrary data between tasks. Here's an example of a task from rc1.py:

@task(
    "config-reload",
    ranks=">0",
    needs_attrs=["config.path"],
    before=["*"],
)
def config_reload(context):
    context.rpc("config.reload").get()

This task runs only on ranks != 0, only if the config.path broker attribute is set and runs before all other tasks. It sends the config.reload rpc and waits for the result.

When modprobe loads a *.py file, it will always first run any defined setup (context) method. This is where the rc file can define modules to load or remove, setup context data, etc. This is currently how an rc.d/*.py could set an alternative or extend module args, or a replacement rc1.py could load a subset of modules (though a more light weight method could be implemented later).

Check out the modules.toml and rc1.py and rc3.py in this PR for full examples.

Transition

For now, the majority of rc1 and rc3 are replaced with flux modprobe rcX. The run through of FLUX_RC_EXTRA is still maintained for backwards compatibility, but some kind of transition plan for flux-sched will need to be implemented. If the overall design here is acceptable, we can work on that next.

Timing

This implementation reduces the rc1 runtime in Flux from ~2.3s to ~.4s for a single rank flux start. To evaluate the prototype, the current version supports a --timing option which dumps the start/end times of all tasks into the KVS. Here's the results for a system instance startup as an example:

2025-04-19-073436_871x504_scrot

grondo avatar Apr 19 '25 14:04 grondo

A really dumb initial comment. Seeing a file called modules.toml would make me think this is how I load modules into flux. Granted documentation is to be written, but perhaps rename to uhhh modules-definitions.toml? (and likewise modules-defnitions.d/)?

chu11 avatar Apr 22 '25 23:04 chu11

it was originally modprobe.toml. Would that be preferable, since it is how you configure the modprobe tool?

grondo avatar Apr 22 '25 23:04 grondo

it was originally modprobe.toml. Would that be preferable, since it is how you configure the modprobe tool?

Yeah, I think that would be better.

chu11 avatar Apr 23 '25 05:04 chu11

This is such a huge improvement and seems very well designed! IMHO we should press forward with this!

Apologies if this is getting ahead of this PR, but for selecting a scheduler or content back end, I wonder if an integer priority like in DebianAlternatives would be slightly more robust than using solely the position in the modules array? Then a framework project that implements a non-production grade alternative for something can declare a low priority in their toml fragment and avoid inadvertently becoming the default due to fragment sorting order. To further control alternative selections, there could be:

  • an optional site-provided alternatives.toml or something to override the default selection
  • per-instance override via the regular TOML config

Yeah this comment belongs in the "next steps discussion" category not in a critique of this PR, which improves upon the status quo in so many other ways.

garlick avatar May 20 '25 15:05 garlick

Oh yeah! I didn't think of that and alternatives style priority is a great idea.

grondo avatar May 20 '25 15:05 grondo

Ok, this is still a WIP, but I've made the use of flux modprobe in rc1 and rc3 opt-in (set FLUX_RC_USE_MODPROBE in the environment) to allow framework projects like flux-sched to transition to the new scheme without coordination.

So for now, modprobe will be used in the flux-core testsuite -- flux-sched tests should pass because we fall back to FLUX_RC_USE_MODPROBE.

Meanwhile, I can work up a PR for flux-sched that adds a modprobe.d/fluxion.toml file, and tweak the sharness environment so the toml file is found during testing, then set FLUX_RC_USE_MODPROBE and ensure everything works.

We can then merge this PR when ready, merge the flux-sched PR adopting modprobe there, and if everything works, finally remove the backwards compat from etc/rc1 and etc/rc3 in flux-core.

This should go slow with lots of testing.

grondo avatar Jun 26 '25 03:06 grondo

Also forgot to mention that this now has support for a module priority value. Module alternatives are sorted first by priority then by load order, so the last module configured gets preference if priorities are the same. The default priority is 100, and sched-simple has a configured priority of 50.

The priority of an existing module can be updated via a new TOML fragment, e.g.:

sched-simple.priority=1000

or via the [modules] table in the broker config:

flux batch --conf=modules.sched-simple.priority=1000 ...

Since the priority of other modules may not be known, an alternative can still be selected using the [modules.alternatives] table:

flux batch --conf=modules.alternatives.sched=sched-simple ...

grondo avatar Jun 27 '25 00:06 grondo

I've updated this PR with a few other modifications to support better integration with other framework projects like fluxion:

  • new env var FLUX_MODPROBE_DISABLE supports a comma-separated list of modules to disable.
  • disabled modules disable modules/tasks for which the disabled module is in the needs list
  • for backwards compatibility, FLUX_SCHED_MODULE=none disables the sched and feasibility services (and corresponding modules)
  • when a module is disabled by name, the next highest priority alternative will be used (if one is available). This lets you load sched-simple with FLUX_MODPROBE_DISABLE=sched-fluxion-resource for example.
  • module arguments can be overridden in in context.setopt() with the new parameter overwrite=True.
  • context.setopt() splits on whitespace by default
  • added a --dry-run option to flux modprobe run|rc1|rc3 (useful for testing)

grondo avatar Jul 16 '25 00:07 grondo

Just a thought, but it's the 17th, about 2 weeks from our next release. What if we merge this ASAP (after a quick review) and just plow straight ahead and work out any issues before the release? It could always be reverted if that proves to be too hard? I think this is going to be great.

garlick avatar Jul 17 '25 21:07 garlick

Unfortunately, there still isn't any documentation and I'm working through one last small feature: sysadmins may want enable something like the feasibility service on login nodes on a case-by-case basis, so we may want a way to allow services to be selectively enabled in setup(), regardless of the default rank = conditional.

That being said, in the current version FLUX_RC_USE_MODPROBE is required to be set to even use the new startup mechanism, so maybe we could go with that and produce some quick documentation, with the idea that we'd remove the older rc1 in a future release (giving us time to switch Fluxion over to the new thing as well)

grondo avatar Jul 17 '25 21:07 grondo

Oh, one other small issue here that could use someone's opinion:

When using flux modprobe, the current approach just sticks flux modprobe rc1/3 into rc1 and rc3, and keeps the existing loop to run files from rc1.d or rc3.d (with a fix to only run regular, executable files). However, modprobe also allows Python extensions to add extra setup() or tasks in the same rc1.d and rc3.d directories. I've just been keeping these .py files non-executable to avoid running them in the rc1/rc3 loops, but this seems fragile -- maybe they should go somewhere else? Got any opinions, or is the current approach ok?

grondo avatar Jul 17 '25 22:07 grondo

Regarding the .py extensions, agreed, it does seem like those should be in a different, modprobe-specific directory, but I'm drawing a blank on naming :-(

garlick avatar Jul 18 '25 00:07 garlick

I'll just note that the expectation is that there won't be a need for the rc1.d and rc3.d shell scripts after moving to the modprobe implementation (extra modules are defined in TOML config and any extra tasks should probably be defined in Python...)

grondo avatar Jul 18 '25 01:07 grondo

Well a simple fix would be to just put everything under /etc/flux/modprobe, i.e. we'd have:

/etc/flux/modprobe
/etc/flux/modprobe/modprobe.toml
/etc/flux/modprobe/modprobe.d/
/etc/flux/modprobe/rc1.py
/etc/flux/modprobe/rc1.d/
/etc/flux/modprobe/rc3.py
/etc/flux/modprobe/rc3.d/

grondo avatar Jul 18 '25 01:07 grondo

Ok, this PR now places everything modprobe related under /etc/flux/modprobe.

I've also added a rough draft flux-modprobe(1) manpage, which should help reviewers or people that want to review the interface at least get a high level view.

I've removed WIP to get some real feedback hopefully.

grondo avatar Jul 28 '25 22:07 grondo

:egg:cellent! I'll try to get some review comments in while you are on vaca.

garlick avatar Jul 29 '25 00:07 garlick

Ok, I guess I'll set MWP here. This seems fairly safe since it doesn't do anything by default yet...

grondo avatar Aug 05 '25 01:08 grondo

Codecov Report

:x: Patch coverage is 96.23188% with 26 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 84.01%. Comparing base (dafb921) to head (cb6b45f). :warning: Report is 610 commits behind head on master.

Files with missing lines Patch % Lines
src/bindings/python/flux/modprobe.py 96.16% 21 Missing :warning:
src/cmd/flux-modprobe.py 97.16% 4 Missing :warning:
src/bindings/python/flux/cli/base.py 50.00% 1 Missing :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6774      +/-   ##
==========================================
+ Coverage   83.90%   84.01%   +0.11%     
==========================================
  Files         543      545       +2     
  Lines       91115    91805     +690     
==========================================
+ Hits        76447    77134     +687     
- Misses      14668    14671       +3     
Files with missing lines Coverage Δ
src/bindings/python/flux/cli/base.py 95.65% <50.00%> (+0.14%) :arrow_up:
src/cmd/flux-modprobe.py 97.16% <97.16%> (ø)
src/bindings/python/flux/modprobe.py 96.16% <96.16%> (ø)

... and 27 files with indirect coverage changes

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Nov 03 '25 02:11 codecov[bot]