ragflow icon indicating copy to clipboard operation
ragflow copied to clipboard

Feat: Support production deployment mode by using gunicorn.

Open zyoung1212 opened this issue 7 months ago • 23 comments

What problem does this PR solve?

This PR adds Gunicorn support to enable production-ready deployment of the RAGFlow framework, replacing the default development server with a robust WSGI HTTP server. This change significantly improves performance, stability, and concurrency handling in real-world deployments.

Type of change

  • [x] New Feature (non-breaking change which adds functionality)
  • [x] Performance Improvement

zyoung1212 avatar May 26 '25 02:05 zyoung1212

Is this error related to my PR?

zyoung1212 avatar May 26 '25 03:05 zyoung1212

Could you give me some advice, please? @KevinHuSh

zyoung1212 avatar May 26 '25 06:05 zyoung1212

Would you please check if the test cases can run in your local environment?

JinHai-CN avatar May 30 '25 05:05 JinHai-CN

I'm trying.

zyoung1212 avatar Jun 02 '25 17:06 zyoung1212

Hi team, @JinHai-CN @ZhenhangTung @KevinHuSh

Apologies for the earlier PR that failed CI. I have since:

  • Re‑worked the build – all checks now pass on my own CI runner (8 vCPU / 32 GB RAM).
  • Added Gunicorn + gevent for true production deployment.
  • Hardened thread‑safety issues that surfaced with multi‑worker Gunicorn.
  • Updated Docker entrypoint, added conf/gunicorn.conf.py, a WSGI module and pinned the required gunicorn / gevent dependencies.
  • Minor tweaks: safer table creation, clearer startup warnings, distributed‑lock for DB‑init, etc. (See full diff in this PR.)

Although the test suite is green, these changes have not yet been exercised in a live production environment. Could you please run a thorough review and, if possible, a staging deployment to confirm everything behaves as expected?

Thank you for your time and feedback!

zyoung1212 avatar Jun 04 '25 13:06 zyoung1212

Appreciations! It's almost done. Excellent work. I've found that it might bring some unknown risks, such as this, and what @ericwu0930 encountered. For the risk and ambiguous resolutions, could you please let the default starting mode still be debugging?

KevinHuSh avatar Jun 06 '25 02:06 KevinHuSh

Appreciations! It's almost done. Excellent work. I've found that it might bring some unknown risks, such as this, and what @ericwu0930 encountered. For the risk and ambiguous resolutions, could you please let the default starting mode still be debugging?

Sure, I will do that.

zyoung1212 avatar Jun 06 '25 02:06 zyoung1212

@zyoung1212 @nareshrajkumar866 I use embedding saas from Qwen, and I think the monkey patch may conflict with Qwen'SDK dashscope.

2025-06-06 09:54:03,859 ERROR 113 exception:
Traceback (most recent call last):
File "/ragflow/rag/llm/embedding_model.py", line 213, in encode_queries
resp = dashscope.TextEmbedding.call(
File "/ragflow/.venv/lib/python3.10/site-packages/dashscope/embeddings/text_embedding.py", line 46, in call
return super().call(model=model,
File "/ragflow/.venv/lib/python3.10/site-packages/dashscope/client/base_api.py", line 146, in call
return request.call()
File "/ragflow/.venv/lib/python3.10/site-packages/dashscope/api_entities/http_request.py", line 84, in call
output = next(response)
File "/ragflow/.venv/lib/python3.10/site-packages/dashscope/api_entities/http_request.py", line 303, in _handle_request
raise e
File "/ragflow/.venv/lib/python3.10/site-packages/dashscope/api_entities/http_request.py", line 286, in _handle_request
response = session.post(url=self.url,
File "/ragflow/.venv/lib/python3.10/site-packages/requests/sessions.py", line 637, in post
return self.request("POST", url, data=data, json=json, **kwargs)
File "/ragflow/.venv/lib/python3.10/site-packages/requests/sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
File "/ragflow/.venv/lib/python3.10/site-packages/requests/sessions.py", line 703, in send
r = adapter.send(request, **kwargs)
File "/ragflow/.venv/lib/python3.10/site-packages/requests/adapters.py", line 589, in send
resp = conn.urlopen(
File "/ragflow/.venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 787, in urlopen
response = self._make_request(
File "/ragflow/.venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 464, in _make_request
self._validate_conn(conn)
File "/ragflow/.venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1093, in _validate_conn
conn.connect()
File "/ragflow/.venv/lib/python3.10/site-packages/urllib3/connection.py", line 741, in connect
sock_and_verified = _ssl_wrap_socket_and_match_hostname(
File "/ragflow/.venv/lib/python3.10/site-packages/urllib3/connection.py", line 882, in ssl_wrap_socket_and_match_hostname
context.verify_mode = resolve_cert_reqs(cert_reqs)
File "/usr/lib/python3.10/ssl.py", line 738, in verify_mode
super(SSLContext, SSLContext).verify_mode.

[Previous line repeated 417 more times]
RecursionError: maximum recursion depth exceeded while calling a Python object

To avoid some unkonwn risks, it will be better to start gunicorn in sync workers mode.

ericwu0930 avatar Jun 06 '25 03:06 ericwu0930

@zyoung1212 @nareshrajkumar866 I use embedding saas from Qwen, and I think the monkey patch may conflict with Qwen'SDK dashscope.

2025-06-06 09:54:03,859 ERROR 113 exception:
Traceback (most recent call last):
File "/ragflow/rag/llm/embedding_model.py", line 213, in encode_queries
resp = dashscope.TextEmbedding.call(
File "/ragflow/.venv/lib/python3.10/site-packages/dashscope/embeddings/text_embedding.py", line 46, in call
return super().call(model=model,
File "/ragflow/.venv/lib/python3.10/site-packages/dashscope/client/base_api.py", line 146, in call
return request.call()
File "/ragflow/.venv/lib/python3.10/site-packages/dashscope/api_entities/http_request.py", line 84, in call
output = next(response)
File "/ragflow/.venv/lib/python3.10/site-packages/dashscope/api_entities/http_request.py", line 303, in _handle_request
raise e
File "/ragflow/.venv/lib/python3.10/site-packages/dashscope/api_entities/http_request.py", line 286, in _handle_request
response = session.post(url=self.url,
File "/ragflow/.venv/lib/python3.10/site-packages/requests/sessions.py", line 637, in post
return self.request("POST", url, data=data, json=json, **kwargs)
File "/ragflow/.venv/lib/python3.10/site-packages/requests/sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
File "/ragflow/.venv/lib/python3.10/site-packages/requests/sessions.py", line 703, in send
r = adapter.send(request, **kwargs)
File "/ragflow/.venv/lib/python3.10/site-packages/requests/adapters.py", line 589, in send
resp = conn.urlopen(
File "/ragflow/.venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 787, in urlopen
response = self._make_request(
File "/ragflow/.venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 464, in _make_request
self._validate_conn(conn)
File "/ragflow/.venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1093, in _validate_conn
conn.connect()
File "/ragflow/.venv/lib/python3.10/site-packages/urllib3/connection.py", line 741, in connect
sock_and_verified = _ssl_wrap_socket_and_match_hostname(
File "/ragflow/.venv/lib/python3.10/site-packages/urllib3/connection.py", line 882, in ssl_wrap_socket_and_match_hostname
context.verify_mode = resolve_cert_reqs(cert_reqs)
File "/usr/lib/python3.10/ssl.py", line 738, in verify_mode
super(SSLContext, SSLContext).verify_mode.

[Previous line repeated 417 more times]
RecursionError: maximum recursion depth exceeded while calling a Python object

To avoid some unkonwn risks, it will be better to start gunicorn in sync workers mode.

Thank you very much for your feedback. I will work on some configuration optimizations to support different Gunicorn modes.

zyoung1212 avatar Jun 06 '25 03:06 zyoung1212

@KevinHuSh @ericwu0930 Hi, I’ve just pushed some updates:

  • Added support for switching between development and production modes via a .env file (default is development).
  • If the production mode is selected, it also supports switching between sync and gevent modes.
  • All changes have passed CI checks.

Would you mind testing again when you have a moment, especially to verify the issue you mentioned earlier? Thanks in advance! @ericwu0930

zyoung1212 avatar Jun 06 '25 06:06 zyoung1212

@zyoung1212 @KevinHuSh Hi all, while introducing Gunicorn can significantly improve concurrency performance, there may be unknown risks. I set up the service using the command gunicorn --workers=20, but encountered a random bug (shown below). I suspect the default MySQL connection pool size is too large (not sure) and am currently working to fix this issue. In summary, be careful to merge this feature.

2025-06-10 16:00:07,182 ERROR    100 (0, '')
Traceback (most recent call last):
  File "/ragflow/.venv/lib/python3.10/site-packages/peewee.py", line 3291, in execute_sql
    cursor.execute(sql, params or ())
  File "/ragflow/.venv/lib/python3.10/site-packages/pymysql/cursors.py", line 153, in execute
    result = self._query(query)
  File "/ragflow/.venv/lib/python3.10/site-packages/pymysql/cursors.py", line 322, in _query
    conn.query(q)
  File "/ragflow/.venv/lib/python3.10/site-packages/pymysql/connections.py", line 562, in query
    self._execute_command(COMMAND.COM_QUERY, sql)
  File "/ragflow/.venv/lib/python3.10/site-packages/pymysql/connections.py", line 843, in _execute_command
    raise err.InterfaceError(0, "")
pymysql.err.InterfaceError: (0, '')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/ragflow/.venv/lib/python3.10/site-packages/flask/app.py", line 880, in full_dispatch_request
    rv = self.dispatch_request()
  File "/ragflow/.venv/lib/python3.10/site-packages/flask/app.py", line 865, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)  # type: ignore[no-any-return]
  File "/ragflow/api/utils/api_utils.py", line 294, in decorated_function
    objs = APIToken.query(token=token)
  File "/ragflow/api/db/db_models.py", line 212, in query
    return [query_record for query_record in query_records]
  File "/ragflow/.venv/lib/python3.10/site-packages/peewee.py", line 7243, in __iter__
    self.execute()
  File "/ragflow/.venv/lib/python3.10/site-packages/peewee.py", line 2011, in inner
    return method(self, database, *args, **kwargs)
  File "/ragflow/.venv/lib/python3.10/site-packages/peewee.py", line 2082, in execute
    return self._execute(database)
  File "/ragflow/.venv/lib/python3.10/site-packages/peewee.py", line 2255, in _execute
    cursor = database.execute(self)
  File "/ragflow/.venv/lib/python3.10/site-packages/peewee.py", line 3299, in execute
    return self.execute_sql(sql, params)
  File "/ragflow/.venv/lib/python3.10/site-packages/peewee.py", line 3289, in execute_sql
    with __exception_wrapper__:
  File "/ragflow/.venv/lib/python3.10/site-packages/peewee.py", line 3059, in __exit__
    reraise(new_type, new_type(exc_value, *exc_args), traceback)
  File "/ragflow/.venv/lib/python3.10/site-packages/peewee.py", line 192, in reraise
    raise value.with_traceback(tb)
  File "/ragflow/.venv/lib/python3.10/site-packages/peewee.py", line 3291, in execute_sql
    cursor.execute(sql, params or ())
  File "/ragflow/.venv/lib/python3.10/site-packages/pymysql/cursors.py", line 153, in execute
    result = self._query(query)
  File "/ragflow/.venv/lib/python3.10/site-packages/pymysql/cursors.py", line 322, in _query
    conn.query(q)
  File "/ragflow/.venv/lib/python3.10/site-packages/pymysql/connections.py", line 562, in query
    self._execute_command(COMMAND.COM_QUERY, sql)
  File "/ragflow/.venv/lib/python3.10/site-packages/pymysql/connections.py", line 843, in _execute_command
    raise err.InterfaceError(0, "")
peewee.InterfaceError: (0, '')
2025-06-10 16:00:13,135 ERROR    100 (0, '')
Traceback (most recent call last):
  File "/ragflow/.venv/lib/python3.10/site-packages/peewee.py", line 3291, in execute_sql
    cursor.execute(sql, params or ())
  File "/ragflow/.venv/lib/python3.10/site-packages/pymysql/cursors.py", line 153, in execute
    result = self._query(query)
  File "/ragflow/.venv/lib/python3.10/site-packages/pymysql/cursors.py", line 322, in _query
    conn.query(q)
  File "/ragflow/.venv/lib/python3.10/site-packages/pymysql/connections.py", line 562, in query
    self._execute_command(COMMAND.COM_QUERY, sql)
  File "/ragflow/.venv/lib/python3.10/site-packages/pymysql/connections.py", line 843, in _execute_command
    raise err.InterfaceError(0, "")
pymysql.err.InterfaceError: (0, '')
image

ericwu0930 avatar Jun 10 '25 11:06 ericwu0930

@ericwu0930 thanks for raising it.

ZhenhangTung avatar Jun 10 '25 11:06 ZhenhangTung

@zyoung1212 @KevinHuSh @ZhenhangTung Another problem. Login error. After a restart, everything works normally, but after a period of time, I am unable to log in to Ragflow and logged out immediately. The log reports the following error. Any ideas?

werkzeug.exceptions.Unauthorized: 401 Unauthorized: The server could not verify that you are authorized to access the URL requested. You either supplied the wrong credentials (e.g. a bad password), or your browser doesn't understand how to supply the credentials required.
2025-06-11 10:05:02,191 WARNING  3303 load_user got exception Signature b'uoCyaMaa0mZlvg8436I9_06064s' does not match
2025-06-11 10:05:02,191 ERROR    3303 401 Unauthorized: The server could not verify that you are authorized to access the URL requested. You either supplied the wrong credentials (e.g. a bad password), or your browser doesn't understand how to supply the credentials required.
Traceback (most recent call last):
  File "/ragflow/.venv/lib/python3.10/site-packages/flask/app.py", line 880, in full_dispatch_request
    rv = self.dispatch_request()
  File "/ragflow/.venv/lib/python3.10/site-packages/flask/app.py", line 865, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)  # type: ignore[no-any-return]
  File "/ragflow/.venv/lib/python3.10/site-packages/flask_login/utils.py", line 285, in decorated_view
    return current_app.login_manager.unauthorized()
  File "/ragflow/.venv/lib/python3.10/site-packages/flask_login/login_manager.py", line 178, in unauthorized
    abort(401)
  File "/ragflow/.venv/lib/python3.10/site-packages/flask/helpers.py", line 272, in abort
    current_app.aborter(code, *args, **kwargs)
  File "/ragflow/.venv/lib/python3.10/site-packages/werkzeug/exceptions.py", line 863, in __call__
    raise self.mapping[code](*args, **kwargs)
werkzeug.exceptions.Unauthorized: 401 Unauthorized: The server could not verify that you are authorized to access the URL requested. You either supplied the wrong credentials (e.g. a bad password), or your browser doesn't understand how to supply the credentials required.

ericwu0930 avatar Jun 11 '25 02:06 ericwu0930

@zyoung1212 @KevinHuSh @ZhenhangTung Another problem. Login error. After a restart, everything works normally, but after a period of time, I am unable to log in to Ragflow and logged out immediately. The log reports the following error. Any ideas?

werkzeug.exceptions.Unauthorized: 401 Unauthorized: The server could not verify that you are authorized to access the URL requested. You either supplied the wrong credentials (e.g. a bad password), or your browser doesn't understand how to supply the credentials required.
2025-06-11 10:05:02,191 WARNING  3303 load_user got exception Signature b'uoCyaMaa0mZlvg8436I9_06064s' does not match
2025-06-11 10:05:02,191 ERROR    3303 401 Unauthorized: The server could not verify that you are authorized to access the URL requested. You either supplied the wrong credentials (e.g. a bad password), or your browser doesn't understand how to supply the credentials required.
Traceback (most recent call last):
  File "/ragflow/.venv/lib/python3.10/site-packages/flask/app.py", line 880, in full_dispatch_request
    rv = self.dispatch_request()
  File "/ragflow/.venv/lib/python3.10/site-packages/flask/app.py", line 865, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)  # type: ignore[no-any-return]
  File "/ragflow/.venv/lib/python3.10/site-packages/flask_login/utils.py", line 285, in decorated_view
    return current_app.login_manager.unauthorized()
  File "/ragflow/.venv/lib/python3.10/site-packages/flask_login/login_manager.py", line 178, in unauthorized
    abort(401)
  File "/ragflow/.venv/lib/python3.10/site-packages/flask/helpers.py", line 272, in abort
    current_app.aborter(code, *args, **kwargs)
  File "/ragflow/.venv/lib/python3.10/site-packages/werkzeug/exceptions.py", line 863, in __call__
    raise self.mapping[code](*args, **kwargs)
werkzeug.exceptions.Unauthorized: 401 Unauthorized: The server could not verify that you are authorized to access the URL requested. You either supplied the wrong credentials (e.g. a bad password), or your browser doesn't understand how to supply the credentials required.

Sorry, I've been busy with some company projects lately and haven't had much time to think through the issue. We might need a bit more detail to get a better idea—looking forward to solving it together with you!

zyoung1212 avatar Jun 11 '25 02:06 zyoung1212

@zyoung1212 @KevinHuSh @ZhenhangTung Another problem. Login error. After a restart, everything works normally, but after a period of time, I am unable to log in to Ragflow and logged out immediately. The log reports the following error. Any ideas?

werkzeug.exceptions.Unauthorized: 401 Unauthorized: The server could not verify that you are authorized to access the URL requested. You either supplied the wrong credentials (e.g. a bad password), or your browser doesn't understand how to supply the credentials required.
2025-06-11 10:05:02,191 WARNING  3303 load_user got exception Signature b'uoCyaMaa0mZlvg8436I9_06064s' does not match
2025-06-11 10:05:02,191 ERROR    3303 401 Unauthorized: The server could not verify that you are authorized to access the URL requested. You either supplied the wrong credentials (e.g. a bad password), or your browser doesn't understand how to supply the credentials required.
Traceback (most recent call last):
  File "/ragflow/.venv/lib/python3.10/site-packages/flask/app.py", line 880, in full_dispatch_request
    rv = self.dispatch_request()
  File "/ragflow/.venv/lib/python3.10/site-packages/flask/app.py", line 865, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)  # type: ignore[no-any-return]
  File "/ragflow/.venv/lib/python3.10/site-packages/flask_login/utils.py", line 285, in decorated_view
    return current_app.login_manager.unauthorized()
  File "/ragflow/.venv/lib/python3.10/site-packages/flask_login/login_manager.py", line 178, in unauthorized
    abort(401)
  File "/ragflow/.venv/lib/python3.10/site-packages/flask/helpers.py", line 272, in abort
    current_app.aborter(code, *args, **kwargs)
  File "/ragflow/.venv/lib/python3.10/site-packages/werkzeug/exceptions.py", line 863, in __call__
    raise self.mapping[code](*args, **kwargs)
werkzeug.exceptions.Unauthorized: 401 Unauthorized: The server could not verify that you are authorized to access the URL requested. You either supplied the wrong credentials (e.g. a bad password), or your browser doesn't understand how to supply the credentials required.

@ZhenhangTung @KevinHuSh Hi! May I ask you under what circumstances would this error occur? What direction should I investigate? I noticed there were similar errors #7249. What's more, this issue doesn't appear in the test environment but only occurs in the production environment.

ericwu0930 avatar Jun 11 '25 12:06 ericwu0930

Hello, @ericwu0930 Could you please tell me what the startup parameters are for the above two issues (except the known --workers=20)? Are they using the sync mode or gevent mode? Additionally, are there any stable methods to trigger these two issues currently?

zyoung1212 avatar Jun 11 '25 14:06 zyoung1212

Hello, @ericwu0930 Could you please tell me what the startup parameters are for the above two issues (except the known --workers=20)? Are they using the sync mode or gevent mode? Additionally, are there any stable methods to trigger these two issues currently?

sync mode.

image The first issue is triggered when I requested /api/v1/retrieval continuously. And it was resolved by reducing MySQL's max connections to 100. Actually, I haven't figured out the reason completely.

Another one has no idea. Except for being kicked out immediately after login which prevents me from operating on the ragflow website, everything goes well. image

ericwu0930 avatar Jun 12 '25 03:06 ericwu0930

Hi @ericwu0930 , I've also met a similar 401 issue on my internal deployment before, the reproduce steps from my side are:

  1. start multiple instances with the development server, use a nginx to proxy to these ragflow-01 / ragflow-02 instances;
  2. integrate with okta login sso and try to login, sometimes it would be able to login and then throw 401, sometimes it would directly return "invalid_state"

I've fixed this issue by setting SECRET_KEY and introducing flask redis session to share the session across multiple ragflow backend instances, will take time to submit an individual PR for sharing this fix.

Colstuwjx avatar Jun 17 '25 11:06 Colstuwjx

@zyoung1212 @ericwu0930 could you run the ragflow backend service normally ? I've tried this PR and there are too many issues...

For example, when it's sync mode (gevent would also have issue), it would stuck on exiting master process and must kill it via kill -9 :

# environment: macOS M1 PRO, Python 3.10.13
$ gunicorn --config conf/gunicorn.conf.py  api.wsgi:application
...
[2025-06-18 15:10:41 +0800] [30246] [INFO] Worker exiting (pid: 30246)
[2025-06-18 15:10:41 +0800] [30240] [INFO] Worker exiting (pid: 30240)
[2025-06-18 15:10:44 +0800] [29891] [INFO] Shutting down: Master # will stuck on here forever, and then I must kill it by kill -9
[1]    29891 killed     gunicorn --config conf/gunicorn.conf.py api.wsgi:application

And also when I'm trying to add a chunk by calling POST {{host}}/api/v1/datasets/f1f8ec244abd11f09b2076cb1995044e/documents/ecf88c024abd11f09b2076cb1995044e/chunks API, it would also return 500 / 504:

objc[30573]: +[MPSGraphObject initialize] may have been in progress in another thread when fork() was called.
objc[30573]: +[MPSGraphObject initialize] may have been in progress in another thread when fork() was called. We cannot safely call it or ignore it in the fork() child process. Crashing instead. Set a breakpoint on objc_initializeAfterForkError to debug.
[2025-06-18 15:13:30 +0800] [30510] [ERROR] Worker (pid:30573) was sent SIGKILL! Perhaps out of memory?
RAGFlow worker about to be forked
[2025-06-18 15:13:30 +0800] [30581] [INFO] Booting worker with pid: 30581
RAGFlow worker spawned (pid: 30581)

Colstuwjx avatar Jun 18 '25 07:06 Colstuwjx

@Colstuwjx @zyoung1212 The root cause has been identified. On June 11, I started 10 Gunicorn processes. One of them crashed and restarted on June 17. RagFlow's JWT is derived from secret_key, but the original RagFlow did not assign a value to secret_key, so it defaulted to the process's startup date. As a result, 9 machines had JWTs dated 2025-06-11, while the new one had a JWT dated 2025-06-17. If a user's login request is routed to one process but subsequent API calls hit a different process with a mismatched JWT date, they get kicked out. 企业微信截图_17502420362075

I've solved this by setting SECRET_KEY. I think it's no need to introduce flask redis session.

ericwu0930 avatar Jun 18 '25 11:06 ericwu0930

@Colstuwjx @zyoung1212 The root cause has been identified. On June 11, I started 10 Gunicorn processes. One of them crashed and restarted on June 17. RagFlow's JWT is derived from secret_key, but the original RagFlow did not assign a value to secret_key, so it defaulted to the process's startup date. As a result, 9 machines had JWTs dated 2025-06-11, while the new one had a JWT dated 2025-06-17. If a user's login request is routed to one process but subsequent API calls hit a different process with a mismatched JWT date, they get kicked out. 企业微信截图_17502420362075

I've solved this by setting SECRET_KEY. I think it's no need to introduce flask redis session.

@KevinHuSh @zyoung1212 @Colstuwjx By the way, after upgrading from v0.18 to v0.19, I noticed a gradual memory increase in ragflow container. The process killed on June 17 was likely due to this issue. I don't believe this is related to the introduction of Gunicorn—rather, using Gunicorn may have amplified an existing problem.

企业微信截图_17502433561248

ericwu0930 avatar Jun 18 '25 11:06 ericwu0930

@zyoung1212 @ericwu0930 could you run the ragflow backend service normally ? I've tried this PR and there are too many issues...

For example, when it's sync mode (gevent would also have issue), it would stuck on exiting master process and must kill it via kill -9 :

# environment: macOS M1 PRO, Python 3.10.13
$ gunicorn --config conf/gunicorn.conf.py  api.wsgi:application
...
[2025-06-18 15:10:41 +0800] [30246] [INFO] Worker exiting (pid: 30246)
[2025-06-18 15:10:41 +0800] [30240] [INFO] Worker exiting (pid: 30240)
[2025-06-18 15:10:44 +0800] [29891] [INFO] Shutting down: Master # will stuck on here forever, and then I must kill it by kill -9
[1]    29891 killed     gunicorn --config conf/gunicorn.conf.py api.wsgi:application

And also when I'm trying to add a chunk by calling POST {{host}}/api/v1/datasets/f1f8ec244abd11f09b2076cb1995044e/documents/ecf88c024abd11f09b2076cb1995044e/chunks API, it would also return 500 / 504:

objc[30573]: +[MPSGraphObject initialize] may have been in progress in another thread when fork() was called.
objc[30573]: +[MPSGraphObject initialize] may have been in progress in another thread when fork() was called. We cannot safely call it or ignore it in the fork() child process. Crashing instead. Set a breakpoint on objc_initializeAfterForkError to debug.
[2025-06-18 15:13:30 +0800] [30510] [ERROR] Worker (pid:30573) was sent SIGKILL! Perhaps out of memory?
RAGFlow worker about to be forked
[2025-06-18 15:13:30 +0800] [30581] [INFO] Booting worker with pid: 30581
RAGFlow worker spawned (pid: 30581)

@Colstuwjx I didn’t use the author’s original code entirely—this is my own implementation, which can boot normally in sync mode. https://github.com/infiniflow/ragflow/compare/main...ericwu0930:ragflow:feature/support-gunicorn

ericwu0930 avatar Jun 18 '25 11:06 ericwu0930

Hi @ericwu0930 thanks for your kindly sharing. I foound the issues I mentioned before were caused by the package import of these codes , after moved them inside the post_fork function, the guicorn workers could be started normally on my local environment ( my local environment is macOS M1 PRO).

However, when I try to benchmark the gunicorn workers, e.g. keep posting create chunk requests to ragflow http endpoint, the gunicorn workers would crash and the benchmark client would return 502 errors like below:

0b92e5637d18efa72/chunks": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[Csv Row number 2593] Worker 34: Request failed: Post "https://ragflow.devcorp.net/api/v1/datasets/c791f5224c3211f098255637d18efa72/documents/cfe131704c3211f0b92e5637d18efa72/chunks": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[Csv Row number 2637] Worker 99: Received non-200 response: 502
[Csv Row number 2636] Worker 77: Received non-200 response: 502
[Csv Row number 2595] Worker 61: Request failed: Post "https://ragflow.devcorp.net/api/v1/datasets/c791f5224c3211f098255637d18efa72/documents/cfe131704c3211f0b92e5637d18efa72/chunks": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[Csv Row number 2597] Worker 7: Request failed: Post "https://ragflow.devcorp.net/api/v1/datasets/c791f5224c3211f098255637d18efa72/documents/cfe131704c3211f0b92e5637d18efa72/chunks": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[Csv Row number 2603] Worker 35: Request failed: Post "https://ragflow.devcorp.net/api/v1/datasets/c791f5224c3211f098255637d18efa72/documents/cfe131704c3211f0b92e5637d18efa72/chunks": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[Csv Row number 2602] Worker 98: Request failed: Post "https://ragflow.devcorp.net/api/v1/datasets/c791f5224c3211f098255637d18efa72/documents/cfe131704c3211f0b92e5637d18efa72/chunks": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

And when I switch to the original development server, it could be even more stable, less timeout errors, although the process would also finally crashed.

Colstuwjx avatar Jun 18 '25 11:06 Colstuwjx

Hi @ericwu0930 thanks for your kindly sharing. I foound the issues I mentioned before were caused by the package import of these codes , after moved them inside the post_fork function, the guicorn workers could be started normally on my local environment ( my local environment is macOS M1 PRO).嗨,感谢你的分享。我发现之前提到的问题是由于这些代码的包导入造成的,在将它们移动到 post_fork 函数内部后,gunicorn 工作进程可以在我的本地环境(我的本地环境是 macOS M1 PRO)上正常启动。

However, when I try to benchmark the gunicorn workers, e.g. keep posting create chunk requests to ragflow http endpoint, the gunicorn workers would crash and the benchmark client would return 502 errors like below:然而,当我尝试对 gunicorn 工作进程进行基准测试时,例如不断向 ragflow http 端点发送创建块请求,gunicorn 工作进程会崩溃,基准测试客户端会返回如下 502 错误:

0b92e5637d18efa72/chunks": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[Csv Row number 2593] Worker 34: Request failed: Post "https://ragflow.devcorp.net/api/v1/datasets/c791f5224c3211f098255637d18efa72/documents/cfe131704c3211f0b92e5637d18efa72/chunks": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[Csv Row number 2637] Worker 99: Received non-200 response: 502
[Csv Row number 2636] Worker 77: Received non-200 response: 502
[Csv Row number 2595] Worker 61: Request failed: Post "https://ragflow.devcorp.net/api/v1/datasets/c791f5224c3211f098255637d18efa72/documents/cfe131704c3211f0b92e5637d18efa72/chunks": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[Csv Row number 2597] Worker 7: Request failed: Post "https://ragflow.devcorp.net/api/v1/datasets/c791f5224c3211f098255637d18efa72/documents/cfe131704c3211f0b92e5637d18efa72/chunks": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[Csv Row number 2603] Worker 35: Request failed: Post "https://ragflow.devcorp.net/api/v1/datasets/c791f5224c3211f098255637d18efa72/documents/cfe131704c3211f0b92e5637d18efa72/chunks": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[Csv Row number 2602] Worker 98: Request failed: Post "https://ragflow.devcorp.net/api/v1/datasets/c791f5224c3211f098255637d18efa72/documents/cfe131704c3211f0b92e5637d18efa72/chunks": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

And when I switch to the original development server, it could be even more stable, less timeout errors, although the process would also finally crashed.当我切换回原始开发服务器时,它可能会更加稳定,超时错误更少,尽管该进程最终也会崩溃。

Hi, would you be able to share your benchmark? I'll spend some more time troubleshooting the issue later, and I'm also looking forward to optimizing it together with you.

zyoung1212 avatar Jun 19 '25 05:06 zyoung1212

ragflow-perf-test.go.txt

Hi @zyoung1212 , you could use this to do the benchmark, I used another internal csv data to do the perf test, but the behavior is similar to this script.

Colstuwjx avatar Jun 19 '25 07:06 Colstuwjx

@Colstuwjx @zyoung1212 The root cause has been identified. On June 11, I started 10 Gunicorn processes. One of them crashed and restarted on June 17. RagFlow's JWT is derived from secret_key, but the original RagFlow did not assign a value to secret_key, so it defaulted to the process's startup date. As a result, 9 machines had JWTs dated 2025-06-11, while the new one had a JWT dated 2025-06-17. If a user's login request is routed to one process but subsequent API calls hit a different process with a mismatched JWT date, they get kicked out. 企业微信截图_17502420362075 I've solved this by setting SECRET_KEY. I think it's no need to introduce flask redis session.

@KevinHuSh @zyoung1212 @Colstuwjx By the way, after upgrading from v0.18 to v0.19, I noticed a gradual memory increase in ragflow container. The process killed on June 17 was likely due to this issue. I don't believe this is related to the introduction of Gunicorn—rather, using Gunicorn may have amplified an existing problem.

企业微信截图_17502433561248

#7995 related problem.

ericwu0930 avatar Jun 19 '25 08:06 ericwu0930

ragflow version 0.19.0 Hi,when I tried to deploy the service using GUNICORN tool, I encountered some problems: I have added the following code to replace run_Simple() in the script 'ragflow_Server. py'

# start http server
try:
    logging.info("RAGFlow HTTP server start...")
    from gunicorn.app.base import BaseApplication
    class StandaloneApplication(BaseApplication):
        def __init__(self, app, options=None):
            self.options = options or {}
            self.application = app
            super().__init__()

        def load_config(self):
            config = {key: value for key, value in self.options.items()
                      if key in self.cfg.settings and value is not None}
            for key, value in config.items():
                self.cfg.set(key.lower(), value)

        def load(self):
            return self.application

    options = {
        'bind': f"{settings.HOST_IP}:{settings.HOST_PORT}",
        'workers': 10,  # 建议 CPU 核心数 * 2 + 1
        'worker_class': 'sync',  # 使用 gevent 提高并发
        'timeout': 60,
        'pidfile': "gunicorn.pid",  # 输出访问日志到 stdout
        'accesslog': "access.log",  # 输出错误日志到 stdout
        'errorlog': 'error.log',
        'max_requests': 3000,
        'max_requests_jitter':300
    }

    StandaloneApplication(app, options).run()

    # run_simple(
    #     hostname=settings.HOST_IP,
    #     port=settings.HOST_PORT,
    #     application=app,
    #     threaded=True,
    #     use_reloader=RuntimeConfig.DEBUG,
    #     use_debugger=RuntimeConfig.DEBUG,
    # )
except Exception:
    traceback.print_exc()
    stop_event.set()
    time.sleep(1)
    os.kill(os.getpid(), signal.SIGKILL)

problems: 1.Login 鉴权 e0a4c979a138f4290b0b20f1fe45fe1 2.Click the chat button mysql error 568be654537a99f8717b18217993cf2

f073671d45c10319ac35f36ab515967 3.Conversation dialogue HTTP request stuck and raise 500 internal error a9a99065ee1635b2c4431bf3799c8dc

Archerzlt avatar Jun 24 '25 02:06 Archerzlt

based on https://github.com/infiniflow/ragflow/compare/main...ericwu0930:ragflow:feature/support-gunicorn modify conf/gunicorn.conf.py ,add

# Gunicorn configuration file for RAGFlow production deployment
import os
if int(os.environ.get('GUNICORN_GEVENT_MODE',0)) == 1:
    from gevent import monkey
    monkey.patch_all()

    # Import gevent for greenlet spawning
    import gevent
    from gevent import spawn
    USE_GEVENT = True
else:
    USE_GEVENT = False

to the beginning of the file,it will solve ssl recursion problem.

then modify api/wsgi.py ,change if os.environ.get('GUNICORN_GEVENT_MODE',0) == 1: to
if int(os.environ.get('GUNICORN_GEVENT_MODE',0)) == 1:

xujiesh0510 avatar Aug 22 '25 02:08 xujiesh0510

When starting with Gunicorn, multiple workers are enabled (18 workers are configured locally).

When using RagFlow, it redirects to the login page after one or two clicks.

By adding logs, it was found that with different workers and the same query conditions, some can retrieve results while others cannot.

——————————

My temporary solution is to not use a database connection pool.

image

mcj2761358 avatar Sep 11 '25 06:09 mcj2761358

When will this feature be released?

leecj avatar Sep 11 '25 07:09 leecj