mitogen icon indicating copy to clipboard operation
mitogen copied to clipboard

mitogen: Evolution of ModuleFinder and ModuleResponder

Open moreati opened this issue 4 months ago • 10 comments

mitogen.master.ModuleFinder locates Python modules on the controller tthat can be sent to children in response to a GET_MODULE request. It also scans the found module M to identify "related modules" - a dependency tree of modules imported by M, directly or indirectly.

mitogen.master.ModuleResponder sends the found module and all related modules, proactively.

Current bugs and quirks

  • False positives: sending related modules that are undesired, e.g. #1124
  • False negatives: missing related modules e.g. #682

Ideas so far, not mutually exclusive

  • Use ast to scan for import statements
    • Eliminates false negatives
    • Potentially increases false positives, depending on how we define desired criteria (e.g. import mitogen.master inside mitogen.main())
  • Remove related functionality, rely on children sending GET_MODULE requests as needed
    • Eliminates a chunk of fiddly code
    • Reduces CPU and random disk IO on controller
    • Increases number of round trips
    • May increase Time To First Execution (TTFE, a term I just made up) on the child
    • May increase maintenance burden wrt new Ansible releases, manually maintaining a preload list of required Ansible modules
    • May make Mitogen less robust wrt to long tail of Python packages and Ansible collections
  • Make related search/delivery configurable, e.g.
    • Something similar to # !mitogen: minify_safe markers
    • Environment variable(s)

Refs #1242 #1324 #1239

moreati avatar Aug 08 '25 09:08 moreati

What caching could Mitogen (optionally) do? E.g.

  • Controller - cache minified modules
  • Parents - cache in-transit modules
  • Targets - cache received modules

Considerations

  • Cache invalidation
  • Information leaks, e.g.
    • target A requests module top.secret.recipe.variant10, measures response time to tell whether a sibling has requested it or not
  • Cache life time/eviction, e.g. process? 1 day? up to 10 MiB? Configurable?

moreati avatar Aug 08 '25 18:08 moreati

Using bytecode may be quicker than ast, if the bytecode can be read from a cached .pyc, rather than parsed each time from .py

moreati avatar Aug 08 '25 19:08 moreati

issue #19: second attempt at import scanner

This version is based on the modulefinder standard library module, pruned back just to handle modules we know have been loaded already, and to scan module-level imports only, rather than imports occurring in class and function scope (crappy heuristic, but assume they are lazy imports).

The ast and compiler modules were far too slow, whereas this version can bytecode compile and scan all the imports for django.db.models (58 modules) in around 200ms.. 3.4ms per dependency, it's probably not going to get much faster than that. -- https://github.com/mitogen-hq/mitogen/commit/43ccbf045963867d8a94a4c3d526425d1c78c0c8, emphasis mine

moreati avatar Aug 09 '25 08:08 moreati

Of note, imports inside try/except and with blocks are never imported lazily. Both structures commonly handle import-related errors: for example, when detecting available or compatible modules -- https://www.hudsonrivertrading.com/hrtbeat/inside-hrts-python-fork/

moreati avatar Aug 09 '25 08:08 moreati

Signals/conditions on target host that may be of use

  • Python 3, attempt to import a known Python 2 module name (#265)
  • Import by a stdlib module, e.g. https://github.com/mitogen-hq/mitogen/blob/db63dd1def114b729d87966f76e013d1478264ab/mitogen/core.py#L1336-L1338
  • Import of well-known platform specific module, e.g. msvcrt
  • https://docs.python.org/3/library/sys.html#sys.builtin_module_names (Atleast 2.4+)
  • https://docs.python.org/3/library/sys.html#sys.stdlib_module_names (3.10+)
  • https://github.com/python/cpython/blob/main/Lib/_compat_pickle.py

moreati avatar Nov 12 '25 13:11 moreati

This package includes lists of all of the standard libraries for Python 2.6 through 3.14.

IMPORTANT: If you're on Python 3.10 or newer, you probably don't need this library. See sys.stdlib_module_names and sys.builtin_module_names for similar functionality. -- https://github.com/pypi/stdlib-list

moreati avatar Nov 14 '25 11:11 moreati

foo being a builtin module does not imply foo.bar will also be builtin. Atleast since Python 3.12

>>> import _imp, sys
>>> sys
<module 'sys' (built-in)>
>>> sys.monitoring
<module 'sys.monitoring'>
>>> _imp.is_builtin('sys'), _imp.is_builtin('sys.monitoring')
(-1, 0)
>>> 'sys' in sys.builtin_module_names, 'sys.monitoring' in sys.builtin_module_names
(True, False)
>>> sys.version_info
sys.version_info(major=3, minor=12, micro=12, releaselevel='final', serial=0)

moreati avatar Nov 14 '25 11:11 moreati

Possible small speedup (benchmarked on CPython 3.14)

$ python -m pyperf timeit -s "import _imp" "_imp.is_builtin('sys')"
.....................
Mean +- std dev: 234 ns +- 2 ns
$ python -m pyperf timeit -s "import sys" "'sys' in sys.builtin_module_names"
.....................
Mean +- std dev: 591 ns +- 3 ns
$ python -m pyperf timeit -s "import sys; bmn=frozenset(sys.builtin_module_names)" "'sys' in bmn"                     
.....................
WARNING: the benchmark result may be unstable
* Not enough samples to get a stable result (95% certainly of less than 1% variation)

Try to rerun the benchmark with more runs, values and/or loops.
Run 'python -m pyperf system tune' command to reduce the system jitter.
Use pyperf stats, pyperf dump and pyperf hist to analyze results.
Use --quiet option to hide these warnings.

Mean +- std dev: 12.9 ns +- 0.4 ns

moreati avatar Nov 14 '25 11:11 moreati

Idea: controller sends list of what it considers (top level) stdlib modules, so children can skip requesting them.

moreati avatar Nov 14 '25 12:11 moreati

Presence/status of __main__ changed over time

$ ... python2.4 -c "import imp, sys; print('%r, %r' % ('__main__' in sys.builtin_module_names, imp.is_builtin('__main__')))"
WARNING: image platform (linux/amd64) does not match the expected platform (linux/arm64)
True, -1
$ ... python2.7 -c "import imp, sys; print('%r, %r' % ('__main__' in sys.builtin_module_names, imp.is_builtin('__main__')))"
WARNING: image platform (linux/amd64) does not match the expected platform (linux/arm64)
True, -1
$ python3.14 -c "import _imp as imp, sys; print('%r, %r' % ('__main__' in sys.builtin_module_names, imp.is_builtin('__main__')))"
False, 0

moreati avatar Nov 14 '25 12:11 moreati