boundary
boundary copied to clipboard
Heavy load on PSQL database
Describe the bug After reaching about 1000 sessions registered in the session table the database becomes very overloaded and slow. This problem occurs both on version 0.7.4 and 0.6.2
To Reproduce Reach a number of ~ 1k in session table.
Expected behavior No heavy load on db
Additional context We have a deployment of 3 controller and about 600 worker geographically distributed. The DB is a Patroni PSQL cluster. I think that the problem is due to the large number of workers trying to check if there are any connections to close. Watching the active query on db it seems that there is slow query blocking the postgres. In detail the following query:
with
-- Find connections that are not closed so we can reference those IDs
unclosed_connections as (
select connection_id
from session_connection_state
where
-- It's the current state
end_time is null
and
-- Current state isn't closed state
state in ('authorized', 'connected')
and
-- It's not in limbo between when it moved into this state and when
-- it started being reported by the worker, which is roughly every
-- 2-3 seconds
start_time < wt_sub_seconds_from_now(10)
),
connections_to_close as (
select public_id
from session_connection
where
-- Related to the worker that just reported to us
server_id = $1
and
-- These are connection IDs that just got reported to us by the given
-- worker, so they should not be considered closed.
-- Only unclosed ones
public_id in (select connection_id from unclosed_connections)
)
update session_connection
set
SELECT * FROM "session_state" WHERE session_id = $1 ORDER BY start_time desc LIMIT 10000
After adding some index, the load on database drastically decrease.
On session_connection
table:
create index session_connection__creation_time
on session_connection (create_time);
create index session_connection__session_id
on session_connection (session_id);
create index session_connection__server_id
on session_connection (server_id);
On session_connection_state
table:
create index session_connection_state__start_time_state
on session_connection_state (start_time, state);
I'm running a significantly smaller deployment than you, but I'm also noticing painfully slow performance. Does this manifest in your environment as sluggish CLI/data fetch? Wondering if our two observed failure modes are related.
Any word on this from any of the fine folks at Hashicorp?
Thanks for this report. We are looking at this query and a few other related queries that were identified during an investigation into a dealock issue and plan on making some changes here.
We are also aware that there is more work to be done to add indexes for performance, see #1881.
Interesting. It seems there is much room to speed up the database. In your opinion, what's the expected time to wait for a connection to init? Is the expectation that using web interfaces like Nomad's which make many many quick requests for backend data over boundary something that's reasonable or is that not really the intended use case for boundary?
I managed to deploy the deadlock fix in my environment (0.7.6 version), the environment is much more stable but the DB is overloaded, the moment I reach 20 concurrent sessions Postgres uses my 16 CPU cores fully (server load is at 25). Scaling it to 1000 sessions looks impossible to me. I can give a shot to your indices, see if it helps
unfortunately didn't help, disabling the WH_ triggers helps, dropped my CPU usage to 16% of a single core and improved bandwidth by a factor of 3
@giulio1979 Have you had more luck with more recent releases? A lot of scaling work went into 0.8 and 0.9 so it'd be good to see if you see improvement.
Closing as there's been no issues reported since the perf fixes that landed in 0.8 and 0.9