langfuse bug: Self hosted container crashes due to random CPU spikes

Describe the bug

We are seeing huge CPU and load spikes, which causes the entire application to crash and the api to be unavailable. Screenshot 2024-09-06 at 4 06 27 PM

Even with load distributed across 2 containers we are seeing the same spike: Screenshot 2024-09-06 at 6 41 18 PM

To reproduce

We do at least 1 request every 5-10 second to our langfuse server. We are running 1 container with 3.75 CPU and 15GB of Memory. We have a total trace count of: 774,917.

We use langchain to make our llm calls.

llm = AzureChatOpenAI(
        deployment_name="nocd-gpt4o",
        openai_api_version="2024-05-01-preview",
        openai_api_key=os.getenv("AZURE_APIM_OPENAI_GPT4O_KEY"),
        azure_endpoint=os.getenv("AZURE_APIM_OPENAI_GPT4O_HOST"),
        model="gpt-4o",
        cache=False
    )

trace = langfuse.trace(
  name=request.project_name,
  user_id=request.user,
  tags=tags,
  metadata=request.metadata if request.metadata else {},
  version=request.version if request.version else "1",
  session_id=session_id,
  input=formatted_prompt,
 )

langfuse_handler = trace.get_langchain_handler()

resp = llm.invoke(formatted_prompt, config={"callbacks": [langfuse_handler]})

trace.update(output=resp)

SDK and container versions

container version: 2.78.0 python sdk version: 2.47.0

Additional information

No response

Are you interested to contribute a fix for this bug?

Yes

Sep 06 '24 22:09 reza-mohideen

@reza-mohideen, thanks for opening the issue. This is very interesting. I have a few follow up questions:

Do you have a special usage pattern? How many traces do you ingest per minute?
Can you share a higher granularity of CPU? Id be interested if the CPU is that high all the time or during some CPU intensive operations
Do you have large Inputs / Outputs for traces and do you tokenise + calculate cost in langfuse? We use tiktoken for tokenization which is quite CPU heavy. (Docs)
Do you see by any chance high error rates on the APIs? Does the UI load for you or do you have high latencies there?
Could you share server logs with us where the server crashes / do you see any crash reason?

Sep 23 '24 14:09 maxdeichmann

@reza-mohideen any additional input here would be super helpful as we do not observe this issue in our own environments. would love to help resolve this or otherwise close the issue

Oct 28 '24 08:10 marcklingen

@reza-mohideen i would recommend upgrading to V3 (https://langfuse.com/self-hosting). This new major version contains many major performance improvements across the board.

Dec 11 '24 14:12 maxdeichmann