boundary Heavy load on PSQL database

Describe the bug After reaching about 1000 sessions registered in the session table the database becomes very overloaded and slow. This problem occurs both on version 0.7.4 and 0.6.2

To Reproduce Reach a number of ~ 1k in session table.

Expected behavior No heavy load on db

Additional context We have a deployment of 3 controller and about 600 worker geographically distributed. The DB is a Patroni PSQL cluster. I think that the problem is due to the large number of workers trying to check if there are any connections to close. Watching the active query on db it seems that there is slow query blocking the postgres. In detail the following query:

  with
    -- Find connections that are not closed so we can reference those IDs
    unclosed_connections as (
      select connection_id
        from session_connection_state
      where
        -- It's the current state
        end_time is null
          and
        -- Current state isn't closed state
        state in ('authorized', 'connected')
          and
        -- It's not in limbo between when it moved into this state and when
        -- it started being reported by the worker, which is roughly every
        -- 2-3 seconds
        start_time < wt_sub_seconds_from_now(10)
    ),
    connections_to_close as (
      select public_id
        from session_connection
      where
        -- Related to the worker that just reported to us
        server_id = $1
          and
        -- These are connection IDs that just got reported to us by the given
        -- worker, so they should not be considered closed.
        -- Only unclosed ones
        public_id in (select connection_id from unclosed_connections)
    )
    update session_connection
      set

 SELECT * FROM "session_state" WHERE session_id = $1 ORDER BY start_time desc LIMIT 10000

After adding some index, the load on database drastically decrease.

On session_connection table:

create index session_connection__creation_time
    on session_connection (create_time);

create index session_connection__session_id
    on session_connection (session_id);

create index session_connection__server_id
    on session_connection (server_id);

On session_connection_state table:

create index session_connection_state__start_time_state
    on session_connection_state (start_time, state);

Feb 10 '22 10:02 emanuelebosetti

I'm running a significantly smaller deployment than you, but I'm also noticing painfully slow performance. Does this manifest in your environment as sluggish CLI/data fetch? Wondering if our two observed failure modes are related.

Feb 14 '22 18:02 the-maldridge

Any word on this from any of the fine folks at Hashicorp?

Feb 28 '22 18:02 the-maldridge

Thanks for this report. We are looking at this query and a few other related queries that were identified during an investigation into a dealock issue and plan on making some changes here.

We are also aware that there is more work to be done to add indexes for performance, see #1881.

Mar 02 '22 12:03 tmessi

Interesting. It seems there is much room to speed up the database. In your opinion, what's the expected time to wait for a connection to init? Is the expectation that using web interfaces like Nomad's which make many many quick requests for backend data over boundary something that's reasonable or is that not really the intended use case for boundary?

Mar 02 '22 17:03 the-maldridge

I managed to deploy the deadlock fix in my environment (0.7.6 version), the environment is much more stable but the DB is overloaded, the moment I reach 20 concurrent sessions Postgres uses my 16 CPU cores fully (server load is at 25). Scaling it to 1000 sessions looks impossible to me. I can give a shot to your indices, see if it helps

Mar 18 '22 11:03 giulio1979

unfortunately didn't help, disabling the WH_ triggers helps, dropped my CPU usage to 16% of a single core and improved bandwidth by a factor of 3

Mar 18 '22 13:03 giulio1979

@giulio1979 Have you had more luck with more recent releases? A lot of scaling work went into 0.8 and 0.9 so it'd be good to see if you see improvement.

Sep 07 '22 17:09 jefferai

Closing as there's been no issues reported since the perf fixes that landed in 0.8 and 0.9

Dec 01 '22 21:12 covetocove

boundary boundary copied to clipboard

Heavy load on PSQL database

boundary
boundary copied to clipboard