uvicorn SSL handshake timed out and channel closed exceptions

The simplified code

import os
from elasticsearch import AsyncElasticsearch
from fastapi import FastAPI, HTTPException, Depends, status
import secrets
from fastapi.security import HTTPBasic, HTTPBasicCredentials
import uvicorn

application = FastAPI()
security = HTTPBasic()

es_client = AsyncElasticsearch([os.getenv('ES_URI')] , maxsize=10000)
es_index = os.getenv('ES_INDEX')
es_doc_type = os.getenv('ES_DOC_TYPE')

def authorize(credentials: HTTPBasicCredentials = Depends(security)):
    correct_username = secrets.compare_digest(credentials.username, os.getenv("BASIC_AUTH_USERNAME"))
    correct_password = secrets.compare_digest(credentials.password, os.getenv("BASIC_AUTH_PASSWORD"))
    if not (correct_username and correct_password):
        raise HTTPException(
            status_code=status.HTTP_401_UNAUTHORIZED,
            detail="Incorrect username or password",
            headers={"WWW-Authenticate": "Basic"},
        )
    return True

query = {
    'query': {
        'function_score': {
            'query': {
                         'bool': ...
                     }
            },
        'score_mode': 'sum', 'boost_mode': 'sum'
    }, 
    'size': 6, 
    '_source': [...]
}


@application.get("/api/v1/smart-search", dependencies=[Depends(authorize)])
async def search(size:int=12, model:str=None, carSize:int=None): 
    ...
    es_res = await es_client.search(index=es_index, doc_type=es_doc_type, body=  query, max_concurrent_shard_requests=5000)
    return es_res

if __name__ == "__main__":
    uvicorn.run(application, host="0.0.0.0", port=5000)

Description

The above code is simplified and creates a dynamic ES query. Executes the ES query using AsyncElasticsearch. The result is parsed and returned. I tried multiple configuration for amongst others:

maxsize=10000 in AsyncElasticsearch client definition
max_concurrent_shard_requests in search
etc.

The application runs in docker. This is the dockerfile CMD:

CMD gunicorn --do-handshake-on-connect --worker-class uvicorn.workers.UvicornWorker --worker-connections 1000 --workers 3 --bind 0.0.0.0:5000 api

I tried running with:

uvicorn standalone and gunicorn with uvicorn worker as worker class.
Different number of workers, worker-connections, etc.
Without and with do-handshake-on-connect

All this runs fine for:

Local setup (Run API local in docker with local ES instance) with as many concurrent users as wanted
API running in Kubernetes environment with (2 to 5 pods) and AWS c5.xlarge.elasticsearch ES cluster up to 60 concurrent users

The API starts failing when running in Kubernetes environment with (2 to 5 pods) and AWS c5.xlarge.elasticsearch ES cluster starting for about 70 concurrent users or more. Load generated on API (pods) is not too high and load on ES is low.

Then 70% of the calls fails with the mainly following errors:

SSLException: handshake timed out (59%)
ClosedChannelException (36%)

I don't have more information. Because this is all that is by Gatling load testing tool.

Environment

requirements.txt

elasticsearch[async]>=7.12.0
starlette==0.13.6
elastic-apm==5.8.1
fastapi==0.60.1
uvicorn==0.11.8
gunicorn==20.1.0

Python version:

Version: 3.6

FROM python:3.6-alpine

Apr 24 '21 12:04 SG87

I'm not familiar with do-handshake-on-connect in gunicorn but from reading their docs it seems it's passed down the wrap_socket method, s there's something I dont get : why dont you have certs / keyfile ? Do you have the same behaviour without this flag set ? In any case a minimal reproducible example would help, without I doubt we can get to the bottom of it

May 20 '21 16:05 euri10

@euri10

The behavior is the same with and without the do-handshake-on-connect.
The application runs in Kubernetes where the certs stuff is handled.

For now we spin up multiple instance of the service to bypass the issue.

Unfortunately I cannot share the connection string to Elasticsearch as this contains proprietary information.

May 21 '21 08:05 SG87

you dont have at least the full traceback ? it's rather hard to get a feel as to what's happening having half the picture.

May 27 '21 13:05 euri10

I'm mentioning the above PR just in case @SG87 , it might be totally unrelated but

I'm closing the transport in the PR above in order to fix a "not properly closed" ressource warning in the ssl tests,
your issue looks related in the sense above a certain level of concurrency you have ssl failures,

Hope that makes sense :dromedary_camel:

May 28 '21 08:05 euri10

Closing this as stale. Feel free to open a new issue with an MRE. :pray:

Oct 28 '22 10:10 Kludex