langflow icon indicating copy to clipboard operation
langflow copied to clipboard

Lang flow Fails to Load Admin/User Information as well as flows when Running on multiple replicas in AWS ECS.

Open anustupdas opened this issue 10 months ago • 5 comments

Langflow Deployment on AWS ECS: Multi-Replica Issue

1. Overview of the Setup

This document outlines the Langflow deployment on AWS ECS, the configurations used, and the issues encountered when scaling the number of replicas. The following sections describe the setup and problem in detail.


2. Langflow Deployment Architecture

The Langflow service is deployed on AWS ECS with the following architecture and configurations:

Core Components

  1. PostgreSQL Database

    • Hosted on AWS RDS.
    • Configuration:
      Engine: PostgreSQL 16.3
      Instance Class: db.t3.micro
      Allocated Storage: 100 GB
      Multi-AZ: Disabled
      Storage Encrypted: Enabled
      Connection Limit: Default
      
  2. ECS Service

    • Service Name: staging-langflow-service
    • Task Definition:
      - Container Image: staging-langflow-service:1.1.0
      - CPU: 1024
      - Memory: 2048 MB
      - Ports:
        - 80 (HTTP)
        - 5000 (Langflow backend)
      - Environment Variables:
        - LANGFLOW_ENABLE_AUTH: true
        - LANGFLOW_DATABASE_URL: postgresql://langflow_test:langflow_test@langflow-tst-db.cwpopkfmn49i.eu-west-1.rds.amazonaws.com:5432/langflow_tst
        - LANGFLOW_SUPERUSER: admin
        - LANGFLOW_SUPERUSER_PASSWORD: password
      
    • CloudWatch Logs: Enabled
    • Desired Count: 1 (by default)
  3. Load Balancer

    • Application Load Balancer (ALB) with HTTPS listener.
    • Target group for routing traffic to ECS tasks.
    • DNS Entry: https://staging-langflow-service.data.studocu.com
  4. Route 53 DNS

    • Subdomain configured for accessing the service.

Image

Image

Image

Image

Image

Image


3. Problem Description

The Langflow service works perfectly with 1 replica. However, when the number of replicas is increased to more than 1, the following issues occur:

Image

Observed Issues

  1. Flow Data Loading Failure
    • Error message: "Could not load flows from database."
    • Flows do not appear in the UI.

Image

  1. Admin Page Information Not Loading
    • Admin page shows no registered users, even though users are present in the database.

Image

  1. Error in Database Connectivity
    • Logs indicate failed or inconsistent database queries.
    • PostgreSQL connection metrics show possible connection pooling limits being exceeded.

Image

  1. Session Inconsistencies
    • Users experience inconsistent session states, potentially caused by lack of session persistence across replicas.

4. Relevant Terraform Configurations

RDS Configuration (Terraform)

resource "aws_db_instance" "langflow_tst_db" {
  engine                                = "postgres"
  engine_version                        = "16.3"
  instance_class                        = "db.t3.micro"
  allocated_storage                     = 100
  multi_az                              = false
  storage_encrypted                     = true
  username                              = "langflow_test"
  password                              = "langflow_test"
  db_name                               = "langflow_tst"
  vpc_security_group_ids                = ["sg-0dbb1812e1e58e8c4"]
}

anustupdas avatar Jan 31 '25 15:01 anustupdas

Now my question is Is the data configuration too tiny that it is causing error to the Database connections?

Image

anustupdas avatar Jan 31 '25 15:01 anustupdas

Hey there, @anustupdas! I'm here to help you with any bugs, questions, or contributions you might have. Let's tackle this together!

The issues you're experiencing with Langflow when running multiple replicas on AWS ECS could be related to several factors, including database connection limits and session management.

  1. Database Connection Limits: The PostgreSQL database hosted on AWS RDS has a max_connections setting of 81. If your application is exceeding this limit, it could lead to connection failures. You might want to consider increasing the max_connections setting on your RDS instance if possible, or optimizing your connection pooling settings in Langflow. The recommended settings for connection pooling in a multi-replica environment are:

    db_connection_settings: dict | None = {
        "pool_size": 10,
        "max_overflow": 20,
        "pool_timeout": 30,
        "pool_pre_ping": True,
    }
    

    These settings help manage the number of connections efficiently [1].

  2. Session Inconsistencies: The session inconsistencies you're observing might be due to the lack of distributed session storage. Langflow's session management details are not explicitly documented, but ensuring that sessions are stored in a shared storage (like Redis) can help maintain session consistency across replicas.

  3. Load Balancing and Sticky Sessions: Ensure that your load balancer is configured to handle sticky sessions if your application requires session persistence. This can help maintain user sessions across multiple replicas.

  4. Resource Allocation: Consider reviewing the resource allocation for your ECS tasks. The current setup with db.t3.micro might be insufficient under load, especially with multiple replicas. Scaling up the instance type or optimizing the application to use resources more efficiently could help.

By addressing these areas, you should be able to mitigate the issues you're facing with Langflow in a multi-replica setup.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

dosubot[bot] avatar Jan 31 '25 15:01 dosubot[bot]

Hi @anustupdas I am facing similar type of issue. Were you able to solve this?

ManjiriN avatar Feb 18 '25 08:02 ManjiriN

Hi, @anustupdas. I'm Dosu, and I'm helping the langflow team manage their backlog. I'm marking this issue as stale.

Issue Summary:

  • The issue involves Langflow deployment on AWS ECS, where scaling to multiple replicas causes failures in loading admin/user information and flows.
  • Potential causes identified include database connection limits and session management issues.
  • Suggested solutions include increasing PostgreSQL max_connections, optimizing connection pooling, using Redis for session storage, configuring sticky sessions, and reviewing ECS task resource allocation.
  • @ManjiriN also reported facing a similar issue and inquired about a solution.

Next Steps:

  • Please let me know if this issue is still relevant to the latest version of the langflow repository by commenting here.
  • If no updates are provided, the issue will be automatically closed in 7 days.

Thank you for your understanding and contribution!

dosubot[bot] avatar May 20 '25 16:05 dosubot[bot]

I worked around the issue with suggested Load Balancing and Sticky Sessions solution.

But I am not aware of any actual solution provided by Langflow itself?

anustupdas avatar May 25 '25 07:05 anustupdas

Maybe related: https://github.com/langflow-ai/langflow/pull/9695

Empreiteiro avatar Sep 15 '25 18:09 Empreiteiro