opentelemetry-python Clarify support for forking process models

https://opentelemetry-python.readthedocs.io/en/latest/examples/fork-process-model/README.html indicates that BatchSpanProcessor does not support forking process models, including those used by Gunicorn and uWSGI. However those docs (https://github.com/open-telemetry/opentelemetry-python/pull/1609) pre-date other changes which appear to have added support for forking process models, including:

https://github.com/open-telemetry/opentelemetry-python/pull/2242, and
https://github.com/open-telemetry/opentelemetry-python/pull/2277

which seem specifically targetted at resolving the described issues.

What is the current expected state of support for the various forking models? Either way, updating the docs to clarify the actual situation would be appreciated.

Possibly related issues:

https://github.com/open-telemetry/opentelemetry-python/issues/2767
https://github.com/open-telemetry/opentelemetry-python/issues/2837
https://github.com/open-telemetry/opentelemetry-python-contrib/issues/675
https://github.com/open-telemetry/opentelemetry-python-contrib/issues/676

May 10 '23 16:05 PeterJCLaw

Thanks for opening this issue, I think it is an important topic.

Afaik, things should be working for tracing at this point. Metrics are a little more complicated and #2837 is definitely still valid. It would be great if we could clean up old issues/documentation and would definitely appreciate help on this if you have the time @PeterJCLaw.

@srikanthccv last worked on this, do you know if any of the linked bugs can be closed out now?

May 11 '23 16:05 aabmass

I think 4th linked issue can be closed; however, it's better to verify once before closing.
For uWSGI, tracing/logging SDK should continue to work.
The exporter mentioned in #2837 is Jaeger Thrift which is planned to be removed in July, but if someone is interested in getting it fixed, they are welcome.
#2767 I am not sure if there is a solution. Last time I checked, uWSGI still doesn't invoke the hooks we rely on.

May 11 '23 18:05 srikanthccv

Thanks for those additional details.

I think 4th linked issue can be closed; however, it's better to verify once before closing.

What does verification look like here? Given the nature of this as a forking issue I'd kinda assumed the failure mode was race conditions and the like -- timing based and hard to reproduce/be sure if fixed just from testing. Is there a known reproduction of the failure mode or something we can do to check that it's working as intended? (Or is it a cased of code inspection?)

May 11 '23 20:05 PeterJCLaw

The verification can be pretty simple. If you run a simple gunicorn based application with multiple workers (Without following the special instructions mentioned in the docs), it should export the traces/metrics/logs. If they are not exported (to the collector, console etc.), then the issue still exists.

May 11 '23 21:05 srikanthccv

Hrm, ok. I'm currently running gunicorn with the default worker type and that does seem to work. My reading of gunicorns docs indicate that the master process doesn't actually load the application though -- only the workers do (after the fork happens), unless you use the --preload option (which I'm currently not doing).

I'll see if I can find some time to validate the behaviours under gunicorn & uwssgi.

Edit for clarity: my use of gunicorn without --preload works fine -- I get tracing reported even though I'm using BatchSpanProcessor.

May 11 '23 21:05 PeterJCLaw

I am confused now. Is it working or not working with gunicorn?

May 12 '23 03:05 srikanthccv

My current setup is using gunicorn without calling --preload and that seems to work. Not using --preload under gunicorn seems to be the default, though the open telemetry docs seem to imply that they're expecting the inverse -- that gunicorn will preload before forking.

May 13 '23 13:05 PeterJCLaw

I've just tested a toy application under gunicorn and uWSGI, using both their --preload and --lazy-apps modes (and not) respectively using https://gist.github.com/PeterJCLaw/5e6fd585d0c620496c82ca926ce50f67. All four configurations seem to work ok -- I get traces printed out to the console which seem reasonable.

I'm not sure how to validate that the desired batching is happening, though I can see that the traces are often not printed until several requests later which suggests that it is working.

May 13 '23 14:05 PeterJCLaw

I've just tested a toy application under gunicorn and uWSGI, using both their --preload and --lazy-apps modes (and not) respectively using https://gist.github.com/PeterJCLaw/5e6fd585d0c620496c82ca926ce50f67. All four configurations seem to work ok -- I get traces printed out to the console which seem reasonable.

Thank you, this confirms the above statement that traces and logs batch processors work all the time.

I'm not sure how to validate that the desired batching is happening, though I can see that the traces are often not printed until several requests later which suggests that it is working

They are emitted if the queue becomes full or batch timeout elapses. I think this is working fine.

May 14 '23 03:05 srikanthccv

Unfortunately while it looks like the BatchSpanProcessor does actually work fine, the PeriodicExportingMetricReader appears not to. I've not tested it under all the various configurations previously discussed, however I have observed it not working correctly in uWSGI's default mode (i.e: where it imports the WSGI app before forking). In that case the ticker thread only exists in the main process, presumably meaning that metrics are only collected from that main process. I'd guess this is due to PeriodicExportingMetricReader using the approach from https://github.com/open-telemetry/opentelemetry-python/pull/2242 rather then the enhanced approach introduced in https://github.com/open-telemetry/opentelemetry-python/pull/2277.

Edit: using uWSGI's --lazy-apps to delay the import does get PeriodicExportingMetricReader running in all the worker processes.

Jun 06 '23 16:06 PeterJCLaw

Hrm, the uWSGI side of this may be somewhat resolvable by users. Ish. Via https://stackoverflow.com/a/72601192 and https://github.com/unbit/uwsgi/pull/2388 it seems that uWSGI has some extra options which can be used to enable os.register_at_fork to work -- specifically --py-call-osafterfork. There's also --enable-threads which seems to be needed to enable workers to have threads. Unfortunately in local testing I'm seeing that using those two options together results in uWSGI not serving any requests (at all); while using just --py-call-osafterfork seems to lock up one of the two workers I'm running with. Not sure yet if this is a uWSGI issue generally (definitely feels like it could be) or something to do with threads. I doubt it's open-telemetry specific. Posting here as hopefully useful to others hitting this.

Jun 06 '23 17:06 PeterJCLaw

+1 to updating docs here since BatchSpanProcessor is working well with Gunicorn. It cost me more than a few hours trying to get the post_fork hook to work (which didn't end up working) when it wasn't necessary.

Jan 29 '24 17:01 SharifiFaranak

opentelemetry-python opentelemetry-python copied to clipboard

Clarify support for forking process models

opentelemetry-python
opentelemetry-python copied to clipboard