mod_wsgi icon indicating copy to clipboard operation
mod_wsgi copied to clipboard

Truncated or oversized response headers received from daemon process

Open FlaskDiploy opened this issue 1 year ago • 6 comments

I am using flask application which is deployed on AWS EC2 and using mod_wsgi version 4.9.0 installled via pip. Pyton version is 3.7.9.

I have daemon processes per flask instances and it is having no issues however for random apis that too not very frequent it is getting internal server error. After debugging it in deeper, found out that I can see errors Truncated or oversized response headers received from daemon process for such apis. Though these api after rerunning are working fine.

I checked where I am using any binary version of the python library like psycog-binary but we are not using it.

We are also using WSGIApplicationGroup %{GLOBAL}

   <Directory /var/www/app/{app_name}/{instance_name}/service>
            WSGIProcessGroup appname{app_name}{instance_name}
            WSGIApplicationGroup %{GLOBAL}
            WSGIScriptReloading On
            Order deny,allow
            Allow from all
            AllowOverride All
           	Header always append X-Frame-Options SAMEORIGIN
            Header add Content-Security-Policy "frame-ancestors 'self'"
            Header add Cache-Control "no-cache, no-store"
			Header set X-XSS-Protection "1; mode=block"
			Header set X-Content-Type-Options nosniff
    </Directory>

We are having WSGIDaemonProcess appname{app_name}{instance_name} user=ec2-user {processes} {threads} python-home=/var/venv/myproduct/{folder_name} {request_timeout} where processes is null and threads are 5 and request timeout is 60

Please check the requirement.txt file:

Packagenanme @ file:///app/buildagent/work/67989a9dc0f7048b/XYZ/scripts/Packagename-1.0.0.tar.gz
asn1crypto==1.5.1
attrs==23.2.0
awscli==1.29.82
azure-core==1.30.1
azure-storage-blob==12.15.0
Babel==2.11.0
backports-abc==0.5
backports.ssl-match-hostname==3.5.0.1
bcrypt==4.0.0
beautifulsoup4==4.7.1
boto3==1.28.77
botocore==1.31.82
bytecode==0.13.0
cattrs==22.2.0
Cerberus==1.3.4
certifi==2023.7.22
cffi==1.14.3
chardet==3.0.4
charset-normalizer==3.3.2
click==8.1.3
colorama==0.4.3
configparser==4.0.2
cramjam==2.8.3
croniter==1.3.15
cryptography==36.0.0
DbManager==1.0.3
ddsketch==2.0.4
ddtrace==1.7.3
Deprecated==1.2.14
dictdiffer==0.9.0
docutils==0.15.2
enum34==1.1.6
envier==0.4.0
et-xmlfile==1.0.1
exceptiongroup==1.2.1
Fabric==1.13.2
fastparquet==0.7.1
filelock==3.12.2
Flask==1.0.2
Flask-Cors==3.0.10
flask-talisman==0.7.0
Flask-WTF==1.0.1
FormsValidator @ file:///app/buildagent/work/67989a9dc0f7048b/XYZ/scripts/FormsValidator-1.0.1.tar.gz
fsspec==2023.1.0
future==0.18.3
google==2.0.2
help-center==1.0
html5lib==1.1
IDMAuthorizer==7.0.0
idna==2.7
importlib-metadata==6.7.0
iniparse==0.4
invoke==1.3.0
ipaddress==1.0.22
isodate==0.6.1
itsdangerous==1.1.0
JayDeBeApi==1.1.1
jdcal==1.4.1
Jinja2==3.0.3
jmespath==0.10.0
JPype1==0.6.3
json-table-schema==0.2.1
jsonmerge==1.8.0
jsonpatch==1.2
jsonpointer==1.9
jsonschema==2.5.1
jwcrypto==1.4
llvmlite==0.31.0
lml==0.1.0
lockfile==0.9.1
logstash-formatter==0.5.16
lxml==4.9.2
MarkupSafe==2.1.1
messytables==0.15.2
mysql-connector-python==8.0.28
Naked==0.1.31
nose==1.3.7
numba==0.49.0
numpy==1.20.3
oauthlib==2.0.1
openpyxl==3.0.0
oscrypto==1.3.0
packaging==24.0
pandas==1.3.5
paramiko==2.10.1
pbr==5.4.4
pkg_resources==0.0.0
ply==3.4
protobuf==4.21.12
psutil==5.7.0
psycopg2==2.7.5
pyarrow==8.0.0
pyasn1==0.4.8
pycparser==2.18
pycryptodome==3.17
pycryptodomex==3.20.0
pyexcel==0.5.12
pyexcel-io==0.6.4
pyexcel-xls==0.5.8
PyExcelerate==0.10.0
PyHive==0.6.1
PyJWT==2.6.0
PyNaCl==1.4.0
pyOpenSSL==22.0.0
pytest-runner==6.0.0
python-dateutil==2.8.1
python-logstash==0.4.6
python-magic==0.4.18
python-snappy==0.5.4
pytz==2022.7.1
PyYAML==5.4
repoze.lru==0.4
requests==2.31.0
RevoLogger==1.0.4
rsa==4.7
s3transfer==0.7.0
Salesforce-FuelSDK==1.3.0
sasl==0.3.1
shellescape==3.4.1
simple-salesforce==0.74.3
simplejson==3.2.0
singledispatch==3.4.0.3
six==1.15.0
snowflake-connector-python==3.0.4
sortedcontainers==2.4.0
soupsieve==2.4.1
sqlparse==0.2.4
stevedore==1.31.0
suds==1.1.2
suds-jurko==0.6
tenacity==8.2.3
texttable==1.6.3
thrift==0.13.0
thrift-sasl==0.3.0
typing_extensions==4.7.1
urllib3==1.26.14
webencodings==0.5.1
Werkzeug==1.0.1
wrapt==1.16.0
WTForms==2.3.3
xlrd==1.2.0
XlsxWriter==1.1.8
xlwt==1.3.0
xmltodict==0.13.0
zipp==3.15.0

I have checked few things around the error and founs out wsgi_scan_headers method in the mod_wsgi packages where this error was getting generated. I think Set-Cookie and WWW-Authenticate should get overlapped and there I started checking how we are creating responses in our application.

    def create_flat_response(self, user_detail=None, should_set_csrf_cookie=False, encode=False):
        from flask import make_response
        compressed_payload_length = None
        if encode:
            compressed_payload = gzip.compress(json_serialize_object(self.payload).encode('utf8'),self.response_compression_level)
            temp_response = make_response(compressed_payload)
            compressed_payload_length = len(compressed_payload)
        else:
            temp_response = make_response(json_serialize_object(self.payload))
        if should_set_csrf_cookie:
            from flask_wtf.csrf import generate_csrf
            csrf_token = generate_csrf()
            self.payload[ApiResponseHeaderConstants.XsrfConstants.XSRF_TOKEN] = csrf_token
            if not encode:
                temp_response = make_response(json_serialize_object(self.payload))
            temp_response.set_cookie(
                ApiResponseHeaderConstants.XsrfConstants.XSRF_TOKEN, csrf_token, secure=True, httponly=True)
        response = temp_response
        response = self.set_response_headers(response, encode, compressed_payload_length)
        self.set_cookies(user_detail=user_detail, response=response)
        return response

In set_cookies we are just setting content length and encoding type in case of encode True. Is there any issue in this too as we are compressing main payload and then setting cooked and response headers?

FlaskDiploy avatar Sep 03 '24 16:09 FlaskDiploy

If you are setting WWW-Authenticate one possibility is that the length of the value for the header is too long and exceeds default Apache allowed buffer size for headers. You can make this larger using the header-buffer-size option to WSGIDaemonProcess. The docs say:

header-buffer-size=nnn Defines the maximum size that a response header/value can be that is returned from a WSGI application. The default size is 32768 bytes. This might need to be overridden where excessively large response headers are returned, such as in custom authentication challenge schemes which use the WWW-Authenticate header.

When that size is exceeded, you will see that error message.

The other reason that error can occur is the daemon process crashed. Check the main Apache error logs to see if there are any errors about segmentation fault, bus error or core dump indicating it crashed.

If you set Apache log level to info rather than err or warn, then a crash will be more evident due to mod_wsgi messages about the process dying and being restarted.

GrahamDumpleton avatar Sep 03 '24 21:09 GrahamDumpleton

Thank you fir your prompt response, however for point 1, we are not using WWW-Authenticate. Still can it be a possibility of oversized headers in this case and if yes then is there any way with which we can come to a number for header-buffer-size?

Point 2, we will try to fetch the logs and will be sharing it soon if since it is not getting reproducible very frequently.

we are also using logger module in our flask app in which we have set stream and file buffer

"level": "WARNING",
    "handler_list": [
      {
        "type": "Stream",
        "level": "debug",
        "format": "%(levelname)s %(asctime)s %(message)s",
        "buffer_config": {
          "buffer_size": 100,
          "flush_level": "warning"
        }
      },
      {
        "type": "File",
        "level": "info",
        "filename": "/tmp/bms.log",
        "format": "%(asctime)s %(logfilepath)s:%(loglineno)d %(message)s",
        "rotate_file_config": {
          "backup_count": 10,
          "max_filesize_in_megabytes": 100
        },
        "buffer_config": {
          "buffer_size": 100,
          "flush_level": "error"
        }
      }
    ]
  },

Is there any relation of the above logging with this error?

FlaskDiploy avatar Sep 04 '24 00:09 FlaskDiploy

Cookies, or any other response header, which are too large can also be an issue. Just try setting header-buffer-size to be 131072 or larger and see if issue goes away.

See if there is a way in Flask to get access to response headers before they are sent and see how big they are. If you can't work out how to do that with Flask, then use a WSGI middleware as described in:

  • https://modwsgi.readthedocs.io/en/master/user-guides/debugging-techniques.html#tracking-request-and-response

GrahamDumpleton avatar Sep 04 '24 01:09 GrahamDumpleton

Sorry for the late reply as I was having constant checks on the logs.

This issue is quite intermediate. I have added below code to print the header info in the flask middleware but observed header size equals to only ~68 bytes for such instances.

@app.after_request
def _set_response_headers(response):
    logger = get_default_logger(__name__)
    logger.exception("####### I am in response middleware ######")
    header_size = sum(len(key) + len(value) for key, value in response.headers.items())
    logger.exception(f"Total header size: {header_size} bytes")
    logger.exception("####### exiting response middleware ######")
    return response

I have also changed the loglevel to info but not getting anything with this occurance. Changed it to debug mode and used WSGIVerboseDebugging On but not getting much out of it too.

Anything more we can check here?

FlaskDiploy avatar Sep 05 '24 18:09 FlaskDiploy

While looking at the logs in more details, I have found few prominent errors :

  1. [Sat Sep 07 12:31:10.065343 2024] [wsgi:error] [pid 16837:tid 16837] [client 10.235.52.125:63822] Truncated or oversized response headers received from daemon process
  2. [Sat Sep 07 12:26:52.280467 2024] [wsgi:error] [pid 16837:tid 16837] [client 10.235.52.125:6534] Timeout when reading response headers from daemon process
  3. [Sat Sep 07 12:43:48.317580 2024] [reqtimeout:info] [pid 16837:tid 16837] [client 10.235.52.58:55114] AH01382: Request header read timeout
  4. [Sat Sep 07 12:43:59.749765 2024] [core:debug] [pid 16837:tid 16837] protocol.c(1482): [client 10.235.52.58:55114] AH00567: request failed: error reading the headers
  5. Daemon process request time limit exceeded, stopping process

However, i am actively monitoring the request and reponse but it is under control and having only few bytes.

WSGIDaemonProcess appname{app_name}{instance_name} user=ec2-user {processes} {threads} python-home=/var/venv/myproduct/{folder_name} {request_timeout} where processes is null and threads are 5 and request timeout is 60. Is process as 1 can cause above issue?

FlaskDiploy avatar Sep 07 '24 16:09 FlaskDiploy

If you are getting request time limited exceeded errors it makes more sense now. You aren't showing all the actual error messages with date time stamps together from the logs though, so you are making it harder for me to confirm.

Is the form of the error message you are getting:

                ap_log_error(APLOG_MARK, APLOG_INFO, 0, wsgi_server,
                             "mod_wsgi (pid=%d): Daemon process request "
                             "time limit exceeded, stopping process "
                             "'%s'.", getpid(), group->name);

and does this happen at the same time as the other error messages?

If this is the case, what is happening is that the request timeout is being exceeded. It will trigger the same error in Apache worker processes because daemon process is being forcibly restarted due to the timeout. If it had only been the error you originally quoted it would have been situations already explained. Thus very important in this case to also see the request timeout message.

From the docs:

request-timeout=sss

Defines the maximum number of seconds that a request is allowed to run before the daemon process is restarted. This can be used to recover from a scenario where a request blocks indefinitely, and where if all request threads were consumed in this way, would result in the whole WSGI application process being blocked.

How this option is seen to behave is different depending on whether a daemon process uses only one thread, or more than one thread for handling requests, as set by the threads option.

If there is only a single thread, and so the process can only handle one request at a time, as soon as the timeout has passed, a restart of the process will be initiated.

If there is more than one thread, the request timeout is applied to the average running time for any requests, across all threads. This means that a request can run longer than the request timeout. This is done to reduce the possibility of interupting other running requests, and causing a user to see a failure. So where there is still capacity to handle more requests, restarting of the process will be delayed if possible.

Basically a request is getting stuck, either thread deadlocked, or waiting on a back end service or database. This is a fail safe restart.

If mod_wsgi is working as intended, you should get a dump of all the Python stack traces on a request timeout so you can try and work out which request was stuck and why. Are you seeing any Python stack traces dumped in the error log when seeing the request timeout error message?

GrahamDumpleton avatar Sep 08 '24 01:09 GrahamDumpleton