uvicorn icon indicating copy to clipboard operation
uvicorn copied to clipboard

SSL handshake timed out and channel closed exceptions

Open SG87 opened this issue 4 years ago • 4 comments

The simplified code

import os
from elasticsearch import AsyncElasticsearch
from fastapi import FastAPI, HTTPException, Depends, status
import secrets
from fastapi.security import HTTPBasic, HTTPBasicCredentials
import uvicorn

application = FastAPI()
security = HTTPBasic()

es_client = AsyncElasticsearch([os.getenv('ES_URI')] , maxsize=10000)
es_index = os.getenv('ES_INDEX')
es_doc_type = os.getenv('ES_DOC_TYPE')

def authorize(credentials: HTTPBasicCredentials = Depends(security)):
    correct_username = secrets.compare_digest(credentials.username, os.getenv("BASIC_AUTH_USERNAME"))
    correct_password = secrets.compare_digest(credentials.password, os.getenv("BASIC_AUTH_PASSWORD"))
    if not (correct_username and correct_password):
        raise HTTPException(
            status_code=status.HTTP_401_UNAUTHORIZED,
            detail="Incorrect username or password",
            headers={"WWW-Authenticate": "Basic"},
        )
    return True

query = {
    'query': {
        'function_score': {
            'query': {
                         'bool': ...
                     }
            },
        'score_mode': 'sum', 'boost_mode': 'sum'
    }, 
    'size': 6, 
    '_source': [...]
}


@application.get("/api/v1/smart-search", dependencies=[Depends(authorize)])
async def search(size:int=12, model:str=None, carSize:int=None): 
    ...
    es_res = await es_client.search(index=es_index, doc_type=es_doc_type, body=  query, max_concurrent_shard_requests=5000)
    return es_res

if __name__ == "__main__":
    uvicorn.run(application, host="0.0.0.0", port=5000)

Description

The above code is simplified and creates a dynamic ES query. Executes the ES query using AsyncElasticsearch. The result is parsed and returned. I tried multiple configuration for amongst others:

  • maxsize=10000 in AsyncElasticsearch client definition
  • max_concurrent_shard_requests in search
  • etc.

The application runs in docker. This is the dockerfile CMD:

CMD gunicorn --do-handshake-on-connect --worker-class uvicorn.workers.UvicornWorker --worker-connections 1000 --workers 3 --bind 0.0.0.0:5000 api

I tried running with:

  • uvicorn standalone and gunicorn with uvicorn worker as worker class.
  • Different number of workers, worker-connections, etc.
  • Without and with do-handshake-on-connect

All this runs fine for:

  • Local setup (Run API local in docker with local ES instance) with as many concurrent users as wanted
  • API running in Kubernetes environment with (2 to 5 pods) and AWS c5.xlarge.elasticsearch ES cluster up to 60 concurrent users

The API starts failing when running in Kubernetes environment with (2 to 5 pods) and AWS c5.xlarge.elasticsearch ES cluster starting for about 70 concurrent users or more. Load generated on API (pods) is not too high and load on ES is low.

Then 70% of the calls fails with the mainly following errors:

  • SSLException: handshake timed out (59%)
  • ClosedChannelException (36%)

I don't have more information. Because this is all that is by Gatling load testing tool.

Environment

requirements.txt

elasticsearch[async]>=7.12.0
starlette==0.13.6
elastic-apm==5.8.1
fastapi==0.60.1
uvicorn==0.11.8
gunicorn==20.1.0
  • Python version:

Version: 3.6

FROM python:3.6-alpine

SG87 avatar Apr 24 '21 12:04 SG87

I'm not familiar with do-handshake-on-connect in gunicorn but from reading their docs it seems it's passed down the wrap_socket method, s there's something I dont get : why dont you have certs / keyfile ? Do you have the same behaviour without this flag set ? In any case a minimal reproducible example would help, without I doubt we can get to the bottom of it

euri10 avatar May 20 '21 16:05 euri10

@euri10

  • The behavior is the same with and without the do-handshake-on-connect.
  • The application runs in Kubernetes where the certs stuff is handled.

For now we spin up multiple instance of the service to bypass the issue.

Unfortunately I cannot share the connection string to Elasticsearch as this contains proprietary information.

SG87 avatar May 21 '21 08:05 SG87

you dont have at least the full traceback ? it's rather hard to get a feel as to what's happening having half the picture.

euri10 avatar May 27 '21 13:05 euri10

I'm mentioning the above PR just in case @SG87 , it might be totally unrelated but

  1. I'm closing the transport in the PR above in order to fix a "not properly closed" ressource warning in the ssl tests,
  2. your issue looks related in the sense above a certain level of concurrency you have ssl failures,

Hope that makes sense :dromedary_camel:

euri10 avatar May 28 '21 08:05 euri10

Closing this as stale. Feel free to open a new issue with an MRE. :pray:

Kludex avatar Oct 28 '22 10:10 Kludex