opentelemetry-python icon indicating copy to clipboard operation
opentelemetry-python copied to clipboard

uwsgi + flask, got Respawned uWSGI worker, very frequently

Open Liubey opened this issue 5 months ago • 12 comments

Describe your environment Describe any aspect of your environment relevant to the problem, including your Python version, platform, version numbers of installed dependencies, information about your cloud hosting provider, etc. If you're reporting a problem with a specific version of a library in this repo, please check whether the problem has been fixed on main.

Steps to reproduce I use the code like:https://github.com/open-telemetry/opentelemetry-python/blob/main/docs/examples/fork-process-model/flask-uwsgi/app.py

What is the expected behavior? report span rightly What is the actual behavior? logs like:

..The work of process 534 is done. Seeya!
Mon Jan 22 17:17:43 2024 - worker 4 (pid: 449) is taking too much time to die...NO MERCY !!!
worker 4 killed successfully (pid: 449)
Respawned uWSGI worker 4 (new pid: 568)

very very frequently. This can't be normal. But uwsgi provided no more logs. and config like this:

master = true
enable-threads = True
harakiri-graceful-timeout = 5
threads = 2
processes = 4

Additional context Add any other context about the problem here.

Liubey avatar Jan 22 '24 09:01 Liubey

update: threads = 1 will be ok. but threads = 2, THE Ninth Request will be hang till the worker Respawned.

Liubey avatar Jan 22 '24 09:01 Liubey

same code, When I use thrift.JaegerExporter(replace OTLPSpanExporter(endpoint="http://localhost:6831", insecure=True)), everything is ok. It’s wired.

Liubey avatar Jan 24 '24 08:01 Liubey

@Liubey

What commands are you running to run the example? Is there a specific process runner you are using?

lzchen avatar Jan 25 '24 18:01 lzchen

@lzchen uwsgi --http :8000 --wsgi-file app.py --callable application --master --enable-threads --threads 2 --processes 4 It’s wired cause replace OTLPSpanExporter to JaegerExporter is ok. I don't know why OTLPSpanExporter does't work fine at uwsgi mutithread.

Liubey avatar Jan 26 '24 01:01 Liubey

Try --lazy-apps

methane avatar Feb 15 '24 11:02 methane

It's not a easy way to change base config for a mature business project😂

Liubey avatar Mar 04 '24 02:03 Liubey

Is it harder than making whole otel-python[-contrib] fork-safe?

methane avatar Mar 04 '24 05:03 methane

😭you are right. I will try to research and drive it.

Liubey avatar Mar 06 '24 07:03 Liubey

FYI, ddtrace-py also require lazy-apps.

https://github.com/DataDog/dd-trace-py/blob/8b11b36d29ee4cda1a6000d10fd74208bbb5f39d/ddtrace/internal/uwsgi.py#L35

methane avatar Mar 06 '24 08:03 methane

@methane Oh, thanks for your addition information. But strange thing is, I have two project, One use django + uwsgi + OTLPSpanExporter, it's ok and not set lazy-apps.

Liubey avatar Mar 06 '24 08:03 Liubey

lazy-apps is the most straightforward. But there are some options to avoid the fork after loading app.

methane avatar Mar 06 '24 08:03 methane

@Liubey Do you read and follow this?

https://github.com/open-telemetry/opentelemetry-python/tree/main/docs/examples/fork-process-model#working-with-fork-process-models

methane avatar Mar 06 '24 08:03 methane

@methane Yes, I saw that before. In my uwsgi + flask project, I use @postfork to test, It still not work, like I sad:"got Respawned uWSGI worker, very frequently"

Liubey avatar Mar 07 '24 02:03 Liubey

I use the code like:https://github.com/open-telemetry/opentelemetry-python/blob/main/docs/examples/fork-process-model/flask-uwsgi/app.py

like? Did you reproduce it with exact the code?

Unless there is a reproducible step, all efforts to investigate will be in vain.

methane avatar Mar 07 '24 03:03 methane

@methane yes, When I use image I got same thing when I wait a few minutes after server startup for receive requests. the highlight is max-requests and thread set.

Liubey avatar Mar 07 '24 08:03 Liubey

maybe the key setter is '''max-requests''', When I remove it, It looks normal.

Liubey avatar Mar 07 '24 08:03 Liubey

When I set '''max-requests=50''' AND '''processes=4''', the 201~204 request will hang for 50 seconds+ image and got "Respawned uWSGI worker " for the 4 processor image

Liubey avatar Mar 07 '24 09:03 Liubey

@jpkrohling @methane @lzchen Could you have any ideas for that? Sorry for disturb your.

Liubey avatar Mar 07 '24 09:03 Liubey

the command is : uwsgi --http :8000 --wsgi-file app.py --callable application --master --enable-threads --threads 2 --processes 4 --max-requests 50

Liubey avatar Mar 07 '24 09:03 Liubey

OK. I got it.

  • BatchSpanProcessor uses daemon thread. So BatchSpanProcessor.shutdown() is needed to stop it.

https://github.com/open-telemetry/opentelemetry-python/blob/8ad10f7d8fd59be49f3d2e683824ff921c0781f8/opentelemetry-sdk/src/opentelemetry/sdk/trace/export/init.py#L189

  • TracerProvider uses atexit to call shutdown(). And its shutdown calls BatchSpanProcessor.shutdown().

https://github.com/open-telemetry/opentelemetry-python/blob/8ad10f7d8fd59be49f3d2e683824ff921c0781f8/opentelemetry-sdk/src/opentelemetry/sdk/trace/init.py#L1181-L1182

  • But uwsgi may not invoke atexit. So deamon thread is not stopped and Python process keep running.

https://github.com/unbit/uwsgi/issues/2063

methane avatar Mar 07 '24 13:03 methane

@methane Thank you very much for your careful analysis. This is unbelievable. Is there an easy solution to bypass the bug, patch the OtelBatchSpanProcessor thread to deamon=False?

Liubey avatar Mar 08 '24 01:03 Liubey

@methane Oh,sorry to mention, When I use JaegerExporter, I also use BatchSpanProcessor, so I think the key thing is OTLPExporter.shutdown or export?

Liubey avatar Mar 08 '24 02:03 Liubey

test with http.OTLPSpanExporter, looks normal, not hanging, just image

Liubey avatar Mar 08 '24 03:03 Liubey

I can reproduce it with http.OTLPSpanExporter. I don't test JaegerExporter yet.

grpc exporter shutdown is not related because it doesn't use daemon thread.

methane avatar Mar 08 '24 04:03 methane

@methane maybe the key thing is grpc, cause I use thrift.JaegerExporter, it's ok. and grpc.JaegerExporter is not ok.

I found some message on grpc rep, see-->https://github.com/grpc/grpc/issues/23796#issuecomment-1028073470 He said : numbsafari commented on Feb 2, 2022 Hi, 2022 checking in. It appears that gRPC doesn’t work with any kind of pre-fork (eg uWSGI) model system unless you specify the environment variables noted above.

Perhaps a solution would be to add some documentation?

Liubey avatar Mar 08 '24 05:03 Liubey

It is separeted issue. It is fork-safe problem. And --lazy-apps (recommended) or @post_fork can avoid it.

methane avatar Mar 08 '24 07:03 methane

I made a pull request to uwsgi that fixes no atexit call issue when mixing threads and max-requests.

methane avatar Mar 10 '24 10:03 methane

@methane Thank you so much. Hope uwsgi could merge the request quickly.😂

Liubey avatar Mar 12 '24 05:03 Liubey

UPDATE: This is not caused by BatchSpanProcessor. Python doesn't wait daemon threads at all. I'm sorry about it.

This is caused by uwsgi. uwsgi uses pthread_cancel() to stop worker even for graceful shutdown. It is really bad idea. pthread_cancel() is almost impossible to use safely. I think both of uwsgi and Python are cancel-safe.

For the record, I can reproduce it easily even I remove all opentelemetry imports.

methane avatar Mar 19 '24 09:03 methane

@methane Thank you for your comment. But what I want to know is that if the thread is set to 1, it doesn't seem to happen this case, even max-requests set.

for example:

 uwsgi --http :8000 --wsgi-file app.py --callable application --master --enable-threads --threads 1 --processes 4 --max-requests 50

stop the worker very fast. image

Liubey avatar Mar 21 '24 02:03 Liubey