mitogen: Evolution of ModuleFinder and ModuleResponder
mitogen.master.ModuleFinder locates Python modules on the controller tthat can be sent to children in response to a GET_MODULE request. It also scans the found module M to identify "related modules" - a dependency tree of modules imported by M, directly or indirectly.
mitogen.master.ModuleResponder sends the found module and all related modules, proactively.
Current bugs and quirks
- False positives: sending related modules that are undesired, e.g. #1124
- False negatives: missing related modules e.g. #682
Ideas so far, not mutually exclusive
- Use
astto scan forimportstatements- Eliminates false negatives
- Potentially increases false positives, depending on how we define desired criteria (e.g.
import mitogen.masterinsidemitogen.main())
- Remove related functionality, rely on children sending GET_MODULE requests as needed
- Eliminates a chunk of fiddly code
- Reduces CPU and random disk IO on controller
- Increases number of round trips
- May increase Time To First Execution (TTFE, a term I just made up) on the child
- May increase maintenance burden wrt new Ansible releases, manually maintaining a preload list of required Ansible modules
- May make Mitogen less robust wrt to long tail of Python packages and Ansible collections
- Make related search/delivery configurable, e.g.
- Something similar to
# !mitogen: minify_safemarkers - Environment variable(s)
- Something similar to
Refs #1242 #1324 #1239
What caching could Mitogen (optionally) do? E.g.
- Controller - cache minified modules
- Parents - cache in-transit modules
- Targets - cache received modules
Considerations
- Cache invalidation
- Information leaks, e.g.
- target A requests module
top.secret.recipe.variant10, measures response time to tell whether a sibling has requested it or not
- target A requests module
- Cache life time/eviction, e.g. process? 1 day? up to 10 MiB? Configurable?
Using bytecode may be quicker than ast, if the bytecode can be read from a cached .pyc, rather than parsed each time from .py
issue #19: second attempt at import scanner
This version is based on the modulefinder standard library module, pruned back just to handle modules we know have been loaded already, and to scan module-level imports only, rather than imports occurring in class and function scope (crappy heuristic, but assume they are lazy imports).
The ast and compiler modules were far too slow, whereas this version can bytecode compile and scan all the imports for django.db.models (58 modules) in around 200ms.. 3.4ms per dependency, it's probably not going to get much faster than that. -- https://github.com/mitogen-hq/mitogen/commit/43ccbf045963867d8a94a4c3d526425d1c78c0c8, emphasis mine
Of note, imports inside try/except and with blocks are never imported lazily. Both structures commonly handle import-related errors: for example, when detecting available or compatible modules -- https://www.hudsonrivertrading.com/hrtbeat/inside-hrts-python-fork/
Signals/conditions on target host that may be of use
- Python 3, attempt to import a known Python 2 module name (#265)
- Import by a stdlib module, e.g. https://github.com/mitogen-hq/mitogen/blob/db63dd1def114b729d87966f76e013d1478264ab/mitogen/core.py#L1336-L1338
- Import of well-known platform specific module, e.g. msvcrt
- https://docs.python.org/3/library/sys.html#sys.builtin_module_names (Atleast 2.4+)
- https://docs.python.org/3/library/sys.html#sys.stdlib_module_names (3.10+)
- https://github.com/python/cpython/blob/main/Lib/_compat_pickle.py
This package includes lists of all of the standard libraries for Python 2.6 through 3.14.
IMPORTANT: If you're on Python 3.10 or newer, you probably don't need this library. See sys.stdlib_module_names and sys.builtin_module_names for similar functionality. -- https://github.com/pypi/stdlib-list
foo being a builtin module does not imply foo.bar will also be builtin. Atleast since Python 3.12
>>> import _imp, sys
>>> sys
<module 'sys' (built-in)>
>>> sys.monitoring
<module 'sys.monitoring'>
>>> _imp.is_builtin('sys'), _imp.is_builtin('sys.monitoring')
(-1, 0)
>>> 'sys' in sys.builtin_module_names, 'sys.monitoring' in sys.builtin_module_names
(True, False)
>>> sys.version_info
sys.version_info(major=3, minor=12, micro=12, releaselevel='final', serial=0)
Possible small speedup (benchmarked on CPython 3.14)
$ python -m pyperf timeit -s "import _imp" "_imp.is_builtin('sys')"
.....................
Mean +- std dev: 234 ns +- 2 ns
$ python -m pyperf timeit -s "import sys" "'sys' in sys.builtin_module_names"
.....................
Mean +- std dev: 591 ns +- 3 ns
$ python -m pyperf timeit -s "import sys; bmn=frozenset(sys.builtin_module_names)" "'sys' in bmn"
.....................
WARNING: the benchmark result may be unstable
* Not enough samples to get a stable result (95% certainly of less than 1% variation)
Try to rerun the benchmark with more runs, values and/or loops.
Run 'python -m pyperf system tune' command to reduce the system jitter.
Use pyperf stats, pyperf dump and pyperf hist to analyze results.
Use --quiet option to hide these warnings.
Mean +- std dev: 12.9 ns +- 0.4 ns
Idea: controller sends list of what it considers (top level) stdlib modules, so children can skip requesting them.
Presence/status of __main__ changed over time
$ ... python2.4 -c "import imp, sys; print('%r, %r' % ('__main__' in sys.builtin_module_names, imp.is_builtin('__main__')))"
WARNING: image platform (linux/amd64) does not match the expected platform (linux/arm64)
True, -1
$ ... python2.7 -c "import imp, sys; print('%r, %r' % ('__main__' in sys.builtin_module_names, imp.is_builtin('__main__')))"
WARNING: image platform (linux/amd64) does not match the expected platform (linux/arm64)
True, -1
$ python3.14 -c "import _imp as imp, sys; print('%r, %r' % ('__main__' in sys.builtin_module_names, imp.is_builtin('__main__')))"
False, 0