opentelemetry-python
opentelemetry-python copied to clipboard
Clarify support for forking process models
https://opentelemetry-python.readthedocs.io/en/latest/examples/fork-process-model/README.html indicates that BatchSpanProcessor
does not support forking process models, including those used by Gunicorn and uWSGI. However those docs (https://github.com/open-telemetry/opentelemetry-python/pull/1609) pre-date other changes which appear to have added support for forking process models, including:
- https://github.com/open-telemetry/opentelemetry-python/pull/2242, and
- https://github.com/open-telemetry/opentelemetry-python/pull/2277
which seem specifically targetted at resolving the described issues.
What is the current expected state of support for the various forking models? Either way, updating the docs to clarify the actual situation would be appreciated.
Possibly related issues:
- https://github.com/open-telemetry/opentelemetry-python/issues/2767
- https://github.com/open-telemetry/opentelemetry-python/issues/2837
- https://github.com/open-telemetry/opentelemetry-python-contrib/issues/675
- https://github.com/open-telemetry/opentelemetry-python-contrib/issues/676
Thanks for opening this issue, I think it is an important topic.
Afaik, things should be working for tracing at this point. Metrics are a little more complicated and #2837 is definitely still valid. It would be great if we could clean up old issues/documentation and would definitely appreciate help on this if you have the time @PeterJCLaw.
@srikanthccv last worked on this, do you know if any of the linked bugs can be closed out now?
- I think 4th linked issue can be closed; however, it's better to verify once before closing.
- For uWSGI, tracing/logging SDK should continue to work.
- The exporter mentioned in #2837 is Jaeger Thrift which is planned to be removed in July, but if someone is interested in getting it fixed, they are welcome.
- #2767 I am not sure if there is a solution. Last time I checked, uWSGI still doesn't invoke the hooks we rely on.
Thanks for those additional details.
- I think 4th linked issue can be closed; however, it's better to verify once before closing.
What does verification look like here? Given the nature of this as a forking issue I'd kinda assumed the failure mode was race conditions and the like -- timing based and hard to reproduce/be sure if fixed just from testing. Is there a known reproduction of the failure mode or something we can do to check that it's working as intended? (Or is it a cased of code inspection?)
The verification can be pretty simple. If you run a simple gunicorn based application with multiple workers (Without following the special instructions mentioned in the docs), it should export the traces/metrics/logs. If they are not exported (to the collector, console etc.), then the issue still exists.
Hrm, ok. I'm currently running gunicorn with the default worker type and that does seem to work. My reading of gunicorns docs indicate that the master process doesn't actually load the application though -- only the workers do (after the fork happens), unless you use the --preload
option (which I'm currently not doing).
I'll see if I can find some time to validate the behaviours under gunicorn & uwssgi.
Edit for clarity: my use of gunicorn without --preload
works fine -- I get tracing reported even though I'm using BatchSpanProcessor
.
I am confused now. Is it working or not working with gunicorn?
My current setup is using gunicorn without calling --preload
and that seems to work.
Not using --preload
under gunicorn seems to be the default, though the open telemetry docs seem to imply that they're expecting the inverse -- that gunicorn will preload before forking.
I've just tested a toy application under gunicorn and uWSGI, using both their --preload
and --lazy-apps
modes (and not) respectively using https://gist.github.com/PeterJCLaw/5e6fd585d0c620496c82ca926ce50f67. All four configurations seem to work ok -- I get traces printed out to the console which seem reasonable.
I'm not sure how to validate that the desired batching is happening, though I can see that the traces are often not printed until several requests later which suggests that it is working.
I've just tested a toy application under gunicorn and uWSGI, using both their --preload and --lazy-apps modes (and not) respectively using https://gist.github.com/PeterJCLaw/5e6fd585d0c620496c82ca926ce50f67. All four configurations seem to work ok -- I get traces printed out to the console which seem reasonable.
Thank you, this confirms the above statement that traces and logs batch processors work all the time.
I'm not sure how to validate that the desired batching is happening, though I can see that the traces are often not printed until several requests later which suggests that it is working
They are emitted if the queue becomes full or batch timeout elapses. I think this is working fine.
Unfortunately while it looks like the BatchSpanProcessor
does actually work fine, the PeriodicExportingMetricReader
appears not to. I've not tested it under all the various configurations previously discussed, however I have observed it not working correctly in uWSGI's default mode (i.e: where it imports the WSGI app before forking). In that case the ticker thread only exists in the main process, presumably meaning that metrics are only collected from that main process. I'd guess this is due to PeriodicExportingMetricReader
using the approach from https://github.com/open-telemetry/opentelemetry-python/pull/2242 rather then the enhanced approach introduced in https://github.com/open-telemetry/opentelemetry-python/pull/2277.
Edit: using uWSGI's --lazy-apps
to delay the import does get PeriodicExportingMetricReader
running in all the worker processes.
Hrm, the uWSGI side of this may be somewhat resolvable by users. Ish. Via https://stackoverflow.com/a/72601192 and https://github.com/unbit/uwsgi/pull/2388 it seems that uWSGI has some extra options which can be used to enable os.register_at_fork
to work -- specifically --py-call-osafterfork
. There's also --enable-threads
which seems to be needed to enable workers to have threads. Unfortunately in local testing I'm seeing that using those two options together results in uWSGI not serving any requests (at all); while using just --py-call-osafterfork
seems to lock up one of the two workers I'm running with. Not sure yet if this is a uWSGI issue generally (definitely feels like it could be) or something to do with threads. I doubt it's open-telemetry specific. Posting here as hopefully useful to others hitting this.
+1 to updating docs here since BatchSpanProcessor
is working well with Gunicorn. It cost me more than a few hours trying to get the post_fork
hook to work (which didn't end up working) when it wasn't necessary.