[Feature] Python agent enforce exception fallback mechanism
Discussed in https://github.com/apache/skywalking/discussions/10774
Some plugins are prone to errors when libraries are frequently updated and when users did not verify them before deployment. A fallback mechanism should be enforced to make sure no matter in which case, user application should not fail due to a incompatible plugin or agent-internal exceptions. As an alternative way of warning, we should output error logs instead of interruption.
Originally posted by Mushano May 8, 2023
Before Adding Python agent
the service can be used normally
after adding Python agent(whatever 0.8 or 1.0)
any interface related to database queries will report an error like this.

View
It seems that the implantation of agent introduces new risks to the service beyond performance.
Wish
I really want to know Is there a way to avoid the risks posed by agents, such as automatically shutting down agent collection and sending data when detecting errors in the code executed by the agent, so as to ensure the normal operation of the original service.
Sincerely
I have come up with two solutions to solve this problem.
The first is to use a wider range of try-exceptions, covering all plugins and even the entire agent. The advantage of doing this is that it cannot interrupt any work and the code changes are minimal. However, when an internal error occurs within the agent, try-exception may be continuously triggered, which in turn significantly affects the user's code performance.
The second is to set up a hot plug mechanism for plugins, which allows agents to uninstall plugins at runtime. For example, when an agent's internal error occurs multiple times, the agent will automatically disable a specific plugin. This solution seems more reasonable because when an error occurs, we should try to ensure that the user's code runs as usual, even if the Python agent is no longer working. However, this requires significant code changes - almost all plugins require changes. And we also need to confirm whether some accidental and independent errors will cause the entire agent to not work due to this change.
But looking solely at the issue, I think there is one more point worth mentioning.
Currently, the testing ensures the availability of each plugin running separately, but we cannot guarantee whether there will be conflicts between them. I think it is unreasonable for Python agents to default that each plugin is enabled at runtime, as it is almost impossible for users to require the functionality of so many plugins at the same time, and we cannot guarantee whether there will be conflicts between plugins.
In fact, we can dynamically enable plugins on demand, such as using Python's built-in libraries sys.modules or modulefinder. We can detect imported libraries in user code and enable related plugins, while other plugins are not used by default. And we still need to retain the configuration options for manually configuring plugins by users to cope with special situations.
Of course, enabling plugins on demand may not solve this issue. If users use both mysqlclient and pymysql in their code, the agent may still need to consider an exception rollback mechanism to completely solve this problem; But considering that such scenarios are rare and users may not actually know how to configure plugins, I believe that enabling Python agent plugins is still worth optimizing
The hot un-plug-in feature is reasonable. It's about replacing the reference of the monkey patched method back into the original, where we already have that information.
I believe almost all unexpected things are introduced due to the incorrect assumption of typing in plugins that changes throughout their releases. The simple and most effective way is what you mentioned as adding try catch blocks to cover the entire plugin flow (whenever we extract info out of any object). Whenever exception happens, output an error log and unplug the plugin (if exception is thrown many times).
About the conflict of two plugins, it's still catchable by the try-catch all mechanism. It's probably due to one lib actually modifies the other libs objects and leading to the type casting error. The basis is that user already know that they use these two libs together and therefore we can assume it's safe, as long as we add in the try catch mechanism, everything would be fine.
In fact, we can dynamically enable plugins on demand, such as using Python's built-in libraries sys.modules or modulefinder. We can detect imported libraries in user code and enable related plugins, while other plugins are not used by default.
The above mechanism is already there, we installed everything that is importable in user code, other plugins are not installed.
About the conflict of two plugins, it's still catchable by the try-catch all mechanism. It's probably due to one lib actually modifies the other libs objects and leading to the type casting error. The basis is that user already know that they use these two libs together and therefore we can assume it's safe, as long as we add in the try catch mechanism, everything would be fine.
In fact, we can dynamically enable plugins on demand, such as using Python's built-in libraries sys.modules or modulefinder. We can detect imported libraries in user code and enable related plugins, while other plugins are not used by default.
The above mechanism is already there, we installed everything that is importable in user code, other plugins are not installed.
Maybe I made a mistake. But in the source code, agent uses pkgutil.iter_modules to install all plugins in plugin package
def install():
disable_patterns = config.agent_disable_plugins
if isinstance(disable_patterns, str):
disable_patterns = [re.compile(p.strip()) for p in disable_patterns.split(',') if p.strip()]
else:
disable_patterns = [re.compile(p.strip()) for p in disable_patterns if p.strip()]
for importer, modname, _ispkg in pkgutil.iter_modules(skywalking.plugins.__path__):
if any(pattern.match(modname) for pattern in disable_patterns):
logger.info("plugin %s is disabled and thus won't be installed", modname)
continue
logger.debug('installing plugin %s', modname)
plugin = importer.find_module(modname).load_module(modname)
# todo: refactor the version checker, currently it doesn't really work as intended
supported = pkg_version_check(plugin)
if not supported:
logger.debug("check version for plugin %s's corresponding package failed, thus "
"won't be installed", modname)
continue
if not hasattr(plugin, 'install') or inspect.ismethod(plugin.install):
logger.warning("no `install` method in plugin %s, thus the plugin won't be installed", modname)
continue
# noinspection PyBroadException
try:
plugin.install()
logger.debug('Successfully installed plugin %s', modname)
except Exception:
logger.warning(
'Plugin %s failed to install, please ignore this warning '
'if the package is not used in your application.',
modname
)
traceback.print_exc() if logger.isEnabledFor(logging.DEBUG) else None
In the install() method, only user-configured plugins are filtered, and The pkg_version_check() function also has no practical purpose at the moment.
The try-except block may catch ModuleNotFoundError but it also cannot guarantee that plugins will be imported on demand. If the corresponding package has been installed in the user's python environment, but not used in the code, the corresponding plugin will still be loaded.
Did I overlook something?