flux-core
flux-core copied to clipboard
WIP: modprobe: efficient and extensible flux startup and shutdown
This experimental PR is a prototype of a replacement for the Flux rc startup/shutdown system. It started with a few goals:
- parallelize module loading and other rc "tasks" to speed up startup
- improve upon the method of overriding modules that have alternatives, like
schedandfeasiblity - easily restrict modules/tasks to certain ranks
I'm somewhat satisfied with the interface in this prototype, so I'm posting it early for feedback on that aspect of the design, before going on to do documentation and tests. I'm looking for feedback on the overall scheme here, and if it would an acceptable path forward.
I apologize, the description below is lengthy:
The proposed design here consists of 3 components: A TOML configuration specification for expressing modules and their relationships and requirements: modules.toml and etc/modules.d/*.toml, a new Python interface to define tasks that run during rc1 and rc3, and finally a flux modprobe command that processes the previous two files and runs the tasks defined for the runlevel as efficiently as possible. Each of these components is described in more detail below:
modules.toml
The modules.toml file defines modules in Flux, whether they require other modules, any broker attrs/config they need, and the ranks on which they should be loaded. The flux-core modules are all defined in /etc/flux/modules.toml, and extra modules can be defined in modules.d/*.toml. The file currently contains one [[modules]] array, each entry of which defines a module and supports the following keys (this is taken from the top of the existing modules.toml)
# name (required) The module name. This will be the target of module load
# and remove requests. If there is a name collision between entries,
# then the last one loaded will be used.
#
# module (optional) The module to load if different from name.
#
# provides (optional) List of services this module provides, e.g. "sched".
# If multiple modules provide the same service name, the last one
# loaded takes precedence by default, though this can be influenced
# by configuration or broker attributes.
#
# args (optional) An array of module arguments.
#
# ranks (optional) The set of ranks on this module should be loaded.
# May either be an RFC 22 idset string, or a string beginning with
# `>` or `<` followed by a single integer. (e.g. ">0" to load a
# module on all ranks but rank 0.
#
# requires (optional) An array of services this module requires, i.e. if
# this module is loaded then services/modules in requires will also
# be loaded. If this module also has to be loaded after any
# required modules, add them to the after array as well.
#
# after (optional) An array of modules for which this module must be
# loaded after. If this module also requires the service or module
# it is loaded after, then the module must also be added to
# `requires`.
#
# needs (optional) An array of modules which are required for this module
# to be loaded.
#
# needs-config (optional) An array of configuration keys in dotted key
# form which are required for this module to be loaded.
# If the key is not set, then the module is skipped.
#
# needs-attrs (optional) Same as with needs-config, but for broker
# attributes.
Here's an example module:
[[modules]]
name = "cron"
ranks = "0"
requires = ["heartbeat"]
args = ["sync=heartbeat.pulse"]
This entry defines the cron module, only loaded on rank 0. The cron module requires the heartbeat mdoule, and it should be loaded by default with the args sync=heartbeat.pulse.
One caveat: the provides documentation claims to support a way to override the default alternative, but that is not yet implemented.
modprobe rcX.py files
flux modprobe replaces the rcX scripts with a Python file that defines a set of tasks to run and the relationships between those tasks so that interdependent tasks are run in the correct order. Non-module tasks are defined in an Python file via the @task decorator:
def task(name, **kwargs):
"""
Decorator for modprobe "rc" task functions.
This decorator is applied to functions in an rc1 or rc3 python
source file to turn them into valid flux-modprobe(1) tasks.
Args:
name (required, str): The name of this task.
ranks (required, str): A rank expression that indicates on which
ranks this task should be invoked. ``ranks`` may be a valid
RFC 22 Idset string, a single integer prefixed with ``<`` or
``<`` to indicate matching ranks less than or greater than a
given rank, or the string ``all`` (the default if ``ranks``
is not specified). Examples: ``0``, ``>0``, ``0-3``.
requires (options, list): An optional list of task or module names
this tasnk requires. This is used to ensure required tasks are
active when activating another task. It does not indicate that
this task will necessarily be run before the tasks it requires.
(See ``before`` for that feature)
needs (options, list): Disable this task if any task in ``needs`` is
not active.
provides (optional, list): An optional list of string service name
that this task provides. This can be used to set up alternatives
for a given service. (Mostly useful with modules)
before (optional, list): A list of tasks or modules for which this task
must be run before.
after (optional, list) A list of tasks or modules for which this task
must be run after.
needs_attrs (optional, list): A list of broker attributes on which
this task depends. If any of the attributes are not set then the
task will not be run.
needs_config (optional, list): A list of config keys on which this
task depends. If any of the specified config keys are not set,
then this task will not be run.
Example:
::
# Declare a task that will be run after the kvs module is loaded
# only on rank 0
@task("test", ranks="0", needs=["kvs"], after=["kvs"])
def test_kvs_task(context):
# do something with kvs
"""
The context here is a flux.modprobe.Context object which is shared between all tasks, it contains some convenience attributes and methods to get a shared Flux handle, send an rpc, or run something under bash, as well as offering a way to get broker attributes, config, and share arbitrary data between tasks. Here's an example of a task from rc1.py:
@task(
"config-reload",
ranks=">0",
needs_attrs=["config.path"],
before=["*"],
)
def config_reload(context):
context.rpc("config.reload").get()
This task runs only on ranks != 0, only if the config.path broker attribute is set and runs before all other tasks. It sends the config.reload rpc and waits for the result.
When modprobe loads a *.py file, it will always first run any defined setup (context) method. This is where the rc file can define modules to load or remove, setup context data, etc. This is currently how an rc.d/*.py could set an alternative or extend module args, or a replacement rc1.py could load a subset of modules (though a more light weight method could be implemented later).
Check out the modules.toml and rc1.py and rc3.py in this PR for full examples.
Transition
For now, the majority of rc1 and rc3 are replaced with flux modprobe rcX. The run through of FLUX_RC_EXTRA is still maintained for backwards compatibility, but some kind of transition plan for flux-sched will need to be implemented. If the overall design here is acceptable, we can work on that next.
Timing
This implementation reduces the rc1 runtime in Flux from ~2.3s to ~.4s for a single rank flux start. To evaluate the prototype, the current version supports a --timing option which dumps the start/end times of all tasks into the KVS. Here's the results for a system instance startup as an example:
A really dumb initial comment. Seeing a file called modules.toml would make me think this is how I load modules into flux. Granted documentation is to be written, but perhaps rename to uhhh modules-definitions.toml? (and likewise modules-defnitions.d/)?
it was originally modprobe.toml. Would that be preferable, since it is how you configure the modprobe tool?
it was originally modprobe.toml. Would that be preferable, since it is how you configure the modprobe tool?
Yeah, I think that would be better.
This is such a huge improvement and seems very well designed! IMHO we should press forward with this!
Apologies if this is getting ahead of this PR, but for selecting a scheduler or content back end, I wonder if an integer priority like in DebianAlternatives would be slightly more robust than using solely the position in the modules array? Then a framework project that implements a non-production grade alternative for something can declare a low priority in their toml fragment and avoid inadvertently becoming the default due to fragment sorting order. To further control alternative selections, there could be:
- an optional site-provided
alternatives.tomlor something to override the default selection - per-instance override via the regular TOML config
Yeah this comment belongs in the "next steps discussion" category not in a critique of this PR, which improves upon the status quo in so many other ways.
Oh yeah! I didn't think of that and alternatives style priority is a great idea.
Ok, this is still a WIP, but I've made the use of flux modprobe in rc1 and rc3 opt-in (set FLUX_RC_USE_MODPROBE in the environment) to allow framework projects like flux-sched to transition to the new scheme without coordination.
So for now, modprobe will be used in the flux-core testsuite -- flux-sched tests should pass because we fall back to FLUX_RC_USE_MODPROBE.
Meanwhile, I can work up a PR for flux-sched that adds a modprobe.d/fluxion.toml file, and tweak the sharness environment so the toml file is found during testing, then set FLUX_RC_USE_MODPROBE and ensure everything works.
We can then merge this PR when ready, merge the flux-sched PR adopting modprobe there, and if everything works, finally remove the backwards compat from etc/rc1 and etc/rc3 in flux-core.
This should go slow with lots of testing.
Also forgot to mention that this now has support for a module priority value. Module alternatives are sorted first by priority then by load order, so the last module configured gets preference if priorities are the same. The default priority is 100, and sched-simple has a configured priority of 50.
The priority of an existing module can be updated via a new TOML fragment, e.g.:
sched-simple.priority=1000
or via the [modules] table in the broker config:
flux batch --conf=modules.sched-simple.priority=1000 ...
Since the priority of other modules may not be known, an alternative can still be selected using the [modules.alternatives] table:
flux batch --conf=modules.alternatives.sched=sched-simple ...
I've updated this PR with a few other modifications to support better integration with other framework projects like fluxion:
- new env var
FLUX_MODPROBE_DISABLEsupports a comma-separated list of modules to disable. - disabled modules disable modules/tasks for which the disabled module is in the
needslist - for backwards compatibility,
FLUX_SCHED_MODULE=nonedisables theschedandfeasibilityservices (and corresponding modules) - when a module is disabled by name, the next highest priority alternative will be used (if one is available). This lets you load sched-simple with
FLUX_MODPROBE_DISABLE=sched-fluxion-resourcefor example. - module arguments can be overridden in in
context.setopt()with the new parameteroverwrite=True. context.setopt()splits on whitespace by default- added a
--dry-runoption toflux modprobe run|rc1|rc3(useful for testing)
Just a thought, but it's the 17th, about 2 weeks from our next release. What if we merge this ASAP (after a quick review) and just plow straight ahead and work out any issues before the release? It could always be reverted if that proves to be too hard? I think this is going to be great.
Unfortunately, there still isn't any documentation and I'm working through one last small feature: sysadmins may want enable something like the feasibility service on login nodes on a case-by-case basis, so we may want a way to allow services to be selectively enabled in setup(), regardless of the default rank = conditional.
That being said, in the current version FLUX_RC_USE_MODPROBE is required to be set to even use the new startup mechanism, so maybe we could go with that and produce some quick documentation, with the idea that we'd remove the older rc1 in a future release (giving us time to switch Fluxion over to the new thing as well)
Oh, one other small issue here that could use someone's opinion:
When using flux modprobe, the current approach just sticks flux modprobe rc1/3 into rc1 and rc3, and keeps the existing loop to run files from rc1.d or rc3.d (with a fix to only run regular, executable files). However, modprobe also allows Python extensions to add extra setup() or tasks in the same rc1.d and rc3.d directories. I've just been keeping these .py files non-executable to avoid running them in the rc1/rc3 loops, but this seems fragile -- maybe they should go somewhere else? Got any opinions, or is the current approach ok?
Regarding the .py extensions, agreed, it does seem like those should be in a different, modprobe-specific directory, but I'm drawing a blank on naming :-(
I'll just note that the expectation is that there won't be a need for the rc1.d and rc3.d shell scripts after moving to the modprobe implementation (extra modules are defined in TOML config and any extra tasks should probably be defined in Python...)
Well a simple fix would be to just put everything under /etc/flux/modprobe, i.e. we'd have:
/etc/flux/modprobe
/etc/flux/modprobe/modprobe.toml
/etc/flux/modprobe/modprobe.d/
/etc/flux/modprobe/rc1.py
/etc/flux/modprobe/rc1.d/
/etc/flux/modprobe/rc3.py
/etc/flux/modprobe/rc3.d/
Ok, this PR now places everything modprobe related under /etc/flux/modprobe.
I've also added a rough draft flux-modprobe(1) manpage, which should help reviewers or people that want to review the interface at least get a high level view.
I've removed WIP to get some real feedback hopefully.
:egg:cellent! I'll try to get some review comments in while you are on vaca.
Ok, I guess I'll set MWP here. This seems fairly safe since it doesn't do anything by default yet...
Codecov Report
:x: Patch coverage is 96.23188% with 26 lines in your changes missing coverage. Please review.
:white_check_mark: Project coverage is 84.01%. Comparing base (dafb921) to head (cb6b45f).
:warning: Report is 610 commits behind head on master.
Additional details and impacted files
@@ Coverage Diff @@
## master #6774 +/- ##
==========================================
+ Coverage 83.90% 84.01% +0.11%
==========================================
Files 543 545 +2
Lines 91115 91805 +690
==========================================
+ Hits 76447 77134 +687
- Misses 14668 14671 +3
| Files with missing lines | Coverage Δ | |
|---|---|---|
| src/bindings/python/flux/cli/base.py | 95.65% <50.00%> (+0.14%) |
:arrow_up: |
| src/cmd/flux-modprobe.py | 97.16% <97.16%> (ø) |
|
| src/bindings/python/flux/modprobe.py | 96.16% <96.16%> (ø) |
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.