bug: Self hosted container crashes due to random CPU spikes
Describe the bug
We are seeing huge CPU and load spikes, which causes the entire application to crash and the api to be unavailable.
Even with load distributed across 2 containers we are seeing the same spike:
To reproduce
We do at least 1 request every 5-10 second to our langfuse server. We are running 1 container with 3.75 CPU and 15GB of Memory. We have a total trace count of: 774,917.
We use langchain to make our llm calls.
llm = AzureChatOpenAI(
deployment_name="nocd-gpt4o",
openai_api_version="2024-05-01-preview",
openai_api_key=os.getenv("AZURE_APIM_OPENAI_GPT4O_KEY"),
azure_endpoint=os.getenv("AZURE_APIM_OPENAI_GPT4O_HOST"),
model="gpt-4o",
cache=False
)
trace = langfuse.trace(
name=request.project_name,
user_id=request.user,
tags=tags,
metadata=request.metadata if request.metadata else {},
version=request.version if request.version else "1",
session_id=session_id,
input=formatted_prompt,
)
langfuse_handler = trace.get_langchain_handler()
resp = llm.invoke(formatted_prompt, config={"callbacks": [langfuse_handler]})
trace.update(output=resp)
SDK and container versions
container version: 2.78.0 python sdk version: 2.47.0
Additional information
No response
Are you interested to contribute a fix for this bug?
Yes
@reza-mohideen, thanks for opening the issue. This is very interesting. I have a few follow up questions:
- Do you have a special usage pattern? How many traces do you ingest per minute?
- Can you share a higher granularity of CPU? Id be interested if the CPU is that high all the time or during some CPU intensive operations
- Do you have large Inputs / Outputs for traces and do you tokenise + calculate cost in langfuse? We use tiktoken for tokenization which is quite CPU heavy. (Docs)
- Do you see by any chance high error rates on the APIs? Does the UI load for you or do you have high latencies there?
- Could you share server logs with us where the server crashes / do you see any crash reason?
@reza-mohideen any additional input here would be super helpful as we do not observe this issue in our own environments. would love to help resolve this or otherwise close the issue
@reza-mohideen i would recommend upgrading to V3 (https://langfuse.com/self-hosting). This new major version contains many major performance improvements across the board.