data-api-client icon indicating copy to clipboard operation
data-api-client copied to clipboard

Too many connections

Open guidev opened this issue 4 years ago • 34 comments

Hello,

I get this error on about 1% of executions...

Looks like there are no more available connections, but isn't aurora serverless supposed to autoscale automatically?

{
    "errorType": "Runtime.UnhandledPromiseRejection",
    "errorMessage": "BadRequestException: Too many connections",
    "reason": {
        "errorType": "BadRequestException",
        "errorMessage": "Too many connections",
        "code": "BadRequestException",
        "message": "Too many connections",
        "time": "2020-02-04T18:31:38.387Z",
        "requestId": "c0cdad58-cefe-42a8-b3f6-acf1f4bffb07",
        "statusCode": 400,
        "retryable": false,
        "retryDelay": 99.64592804784691,
        "stack": [
            "BadRequestException: Too many connections",
            "    at Object.extractError (/var/task/node_modules/aws-sdk/lib/protocol/json.js:51:27)",
            "    at Request.extractError (/var/task/node_modules/aws-sdk/lib/protocol/rest_json.js:55:8)",
            "    at Request.callListeners (/var/task/node_modules/aws-sdk/lib/sequential_executor.js:106:20)",
            "    at Request.emit (/var/task/node_modules/aws-sdk/lib/sequential_executor.js:78:10)",
            "    at Request.emit (/var/task/node_modules/aws-sdk/lib/request.js:683:14)",
            "    at Request.transition (/var/task/node_modules/aws-sdk/lib/request.js:22:10)",
            "    at AcceptorStateMachine.runTo (/var/task/node_modules/aws-sdk/lib/state_machine.js:14:12)",
            "    at /var/task/node_modules/aws-sdk/lib/state_machine.js:26:10",
            "    at Request.<anonymous> (/var/task/node_modules/aws-sdk/lib/request.js:38:9)",
            "    at Request.<anonymous> (/var/task/node_modules/aws-sdk/lib/request.js:685:12)"
        ]
    },
    "promise": {},
    "stack": [
        "Runtime.UnhandledPromiseRejection: BadRequestException: Too many connections",
        "    at process.<anonymous> (/var/runtime/index.js:35:15)",
        "    at process.emit (events.js:210:5)",
        "    at process.EventEmitter.emit (domain.js:476:20)",
        "    at processPromiseRejections (internal/process/promises.js:201:33)",
        "    at processTicksAndRejections (internal/process/task_queues.js:94:32)"
    ]
}

guidev avatar Feb 04 '20 19:02 guidev

I've never seen this before, but it's still possible if you exceed the total number of concurrent connections. How many ACUs were running and what was your concurrency?

jeremydaly avatar Feb 04 '20 19:02 jeremydaly

It was 1 ACU when it happened, then it scaled to 2 ACU.

Here's the graph for the connections...

Screenshot 2020-02-04 at 21 49 25

It's weird because there wasn't a spike in traffic as the connection graph suggests...

guidev avatar Feb 04 '20 20:02 guidev

We've always experienced Aurora Serverless like this, one of the reasons we stopped using it.

EDIT: I probably should be more in-depth rather than snarky, but the lag between hitting the connection limit and scaling to the next ACU level is substantial, your application needs to be capable of waiting until more connections are available. We had to add try/catch loops everywhere that would catch exactly this situation and just keep slamming the database until it decided to actually scale up (this can take up to 5-10 minutes for each level).

Also another thing you need to make sure you're considering is the database automatically deciding to scale down on you even though you don't want it to. You will suddenly get disconnected on half your jobs and they need to be able to recover from that.

jrgilman avatar Feb 04 '20 22:02 jrgilman

We've hit this too, at a much higher capacity limit - around 64 ACUs. And our graphed cloud watch connections weren't anywhere near the 2k+ limit. I don't understand why this happens with the data-api though, isn't it supposed to be "connectionless"?

AndrewBarba avatar Feb 04 '20 22:02 AndrewBarba

@jrgilman, that's really interesting. Did you reach out to the AWS RDS team? I think this would be incredibly useful feedback for them.

jeremydaly avatar Feb 04 '20 22:02 jeremydaly

@AndrewBarba, I know it's not "connectionless", as it does proxy connections to your Aurora Serverless Cluster. But there seems to be two separate issues here.

  1. Aurora Serverless not scaling fast enough when there are workload spikes
  2. The Data API not utilizing (or managing) those connections correctly (or at maximum utilization)

jeremydaly avatar Feb 04 '20 22:02 jeremydaly

For 1. I can't stress enough how important it is to check the magical "Force scaling the capacity" button:

Screen Shot 2020-02-04 at 5 52 54 PM

We run our largest application db on Aurora Serverless, so we see the scaling events a lot and after checking that (we got burned terribly before having it on) we see scaling events almost instantly.

For 2. you would think since its an http api it would do something much closer to the new RDS proxies instead of actually translating every http request into a connection. That would totally defeat the purpose if it did that. My guess is you're right and its something in between - probably not translating every request to a connection but also not managing the pool correctly.

AndrewBarba avatar Feb 04 '20 22:02 AndrewBarba

We used to use the mysql js client directly. Had the same problem (too many connections), figured it was because we weren't managing the connections correctly, we thought that the Data API was the right solution since it should manage both the connections and the autoscaling in a more "native" way...

Also another thing you need to make sure you're considering is the database automatically deciding to scale down on you even though you don't want it to. You will suddenly get disconnected on half your jobs and they need to be able to recover from that.

@jrgilman even using the Data API?


For 1. I can't stress enough how important it is to check the magical "Force scaling the capacity" button:

@AndrewBarba I've enabled it now, finger crossed

guidev avatar Feb 04 '20 23:02 guidev

AFAIK this has been an on-going "growing pain" for Aurora Serverless. Your current connection count to your cluster will be limited by your current ACU--meaning if you've already hit max_connections for that ACU, you're stuck waiting until additional capacity is allocated.

To be clear, as far as I'm aware, the proxy fleet will simply pass the connection to the cluster, hit the "too many connections" error, and return it back to the client.

If you reached out to premium support you'd likely be met with a response telling you Serverless isn't great for bursty workloads (they may also point out that you could essentially pre-warm, or increase your ACU minimum). If you ARE aware that you're about to have a large increase in connections, you can obviously just force scale:

https://docs.aws.amazon.com/cli/latest/reference/rds/modify-current-db-cluster-capacity.html

Another note here, if you didn't see an increase in traffic during your increase in connection count, it's worth checking your slow query log (assuming you have it enabled). See if you're seeing several queries pileup behind a blocking query, see if you're hitting your configured wait_timeout--and if you are, it may be your retries piling up. SHOW ENGINE INNODB STATUS can be useful here too--see if you have an overall increase in spin waits leading to OS waits during that period.

In any case--any large shift in connections can result in this behavior if it takes a while to add the additional ACU--but in your case you may have an underlying cause if there was no increase in traffic (generally a blocking query is what I'd expect here). This doesn't address the "too many connections" error you're hitting during your ACU changes, but investigating a cause in the connection pileup will help prevent the spike in connection count leading to the error.

I'm happy to help if you have specific questions-- just reach out to me.

kelbyenevoldLA avatar Feb 04 '20 23:02 kelbyenevoldLA

For 1. I can't stress enough how important it is to check the magical "Force scaling the capacity" button:

Screen Shot 2020-02-04 at 5 52 54 PM

We run our largest application db on Aurora Serverless, so we see the scaling events a lot and after checking that (we got burned terribly before having it on) we see scaling events almost instantly.

For 2. you would think since its an http api it would do something much closer to the new RDS proxies instead of actually translating every http request into a connection. That would totally defeat the purpose if it did that. My guess is you're right and its something in between - probably not translating every request to a connection but also not managing the pool correctly.

Force scaling is still insanely slow. Not good for burst workloads as @kelbyenevoldLA mentioned. @jeremydaly we got the exact response that @kelbyenevoldLA mentioned. We did our best, but in the end writing our own autoscaler code was the easiest solution rather than making any of Amazon's offerings try to work for our use case.

@guidev I would imagine you would run into the same problem if the DB scales down and suddenly your connection limit is too low, but I am unfamiliar with the data api (we just used a standard db connection).

jrgilman avatar Feb 05 '20 01:02 jrgilman

PM for Data API here. Thanks for raising this @jeremydaly and others. We are looking into it.

nitesmeh avatar Feb 05 '20 17:02 nitesmeh

Thanks @nitesmeh!

jeremydaly avatar Feb 05 '20 18:02 jeremydaly

Unfortunately, enabling force scaling didn't solve the issue, we have the same error rate as before...

guidev avatar Feb 06 '20 13:02 guidev

@kelbyenevoldLA Thanks for your suggestions...

We haven't been able to find any underlying problem with our queries, obviously, it doesn't mean there isn't one.

We're thinking about switching to serverless-mysql to make sure the issue is the Data API...

guidev avatar Feb 06 '20 13:02 guidev

@nitesmeh Am I right to assume that the data-api should pool requests more efficiently than traditional connections? If not - is there any reason to use the data api if we do not need to access from outside a VPC?

AndrewBarba avatar Feb 06 '20 19:02 AndrewBarba

@guidev Have you fixed the problem?

stefanotauriello avatar May 19 '20 05:05 stefanotauriello

@nitesmeh Am I right to assume that the data-api should pool requests more efficiently than traditional connections? If not - is there any reason to use the data api if we do not need to access from outside a VPC?

My apologies for the delay in responding. That is correct, Data API pools connections so you don't need to worry about connection management. @andrewbarba

nitesmeh avatar Jun 01 '20 20:06 nitesmeh

UPDATE: Actually, nevermind. I think my benchmarks were misattributing CPU load to the connections when it was really my queries. I've redone my benchmarks to focus on connections, and the Data API seems to be performing well for now.

What kind of load is the Data API supposed to be able to handle? I'm running some load tests, and I am very disappointed at what I'm seeing. Using Aurora Serverless with the Data API, I'm very easily generating many connections and CPU usage. My load tests are doing on average between 5-50 http requests per second, pulling around 2000 records for each request. Is this just too much load for the Data API? I was hoping the Data API would truly solve my serverless RDS connection issues, but seems not so far.

I'm also testing direct connections to my Aurora Serverless database versus the Data API, and the Data API might be using more CPU and more connections. I'm not sure what's going on.

lastmjs avatar Sep 10 '20 17:09 lastmjs

First off, thanks SO MUCH for this @jeremydaly. It's awesome!

Secondly, @nitesmeh, can you confirm that the Data API should not consume connections? (And thanks for taking your time helping out here!)

We have a very "uneven" load and have the PostgreSQL Aurora Serverless set as ACU 4-64 and set to "enforce" scaling. We see still see error: remaining connection slots are reserved for non-replication superuser connections in the Lambda logs even though we've moved all heavy DB operations to Data API instead of using knex...

Scaling is triggered but way too late so hundreds of transactions fails (which are backed out to SQS).

QAnders avatar Jan 18 '21 10:01 QAnders

@QAnders, I’m glad you like it. The DATA API does consume connections, but is supposed to act a bit like RDS Proxy in that you shouldn’t have to worry about it since it’s using a connection pool on the backend. There seems to be a lot of people who have had this scaling issue, though.

jeremydaly avatar Jan 18 '21 14:01 jeremydaly

@QAnders @jeremydaly as far as my experience indicates -- each concurrent transaction via the Data API appears to need it's own connection to the database. If you have a workload that uses a lot of concurrent transactions, it'll easily blow through the number of connections available.

If you can eliminate, consolidate, or reduce the number of transactions - that appears to help immensely.

jonathannen avatar May 03 '21 22:05 jonathannen

@jonathannen Thanks for the update on the transactions. I'm currently using TypeORM with the typeorm-aurora-data-api-driver I believe it's creating a new transaction for every query it does. I haven't tried yet but I was thinking about creating a single transaction for every lambda invocation. Do you know if there are any limits on the number of queries you can have in a transaction? Or do you see any other pitfalls with that approach?

seawatts avatar Jul 15 '21 17:07 seawatts

Here to ressurrect this issue -- we're regularly seeing this Too many connections issue which is causing outages for our app.

After investigation I see that even under a small load, the number of open connections the Aurora Serverless cluster explodes, and then the conncetions are kept alive in sleep for long periods of time.

What makes our setup different is we're also using AppSync RDS resolvers for our GraphQL API separately from our lambdas which are handling webhooks and using the Data API.

@nitesmeh and anyone else -- would it be possible that the Data API's connection pooling mechanism is using up all our connections and not leaving enough for the AppSync RDS resolvers and this is what's causing the issues... Any insight?

ortonomy avatar Jan 18 '22 06:01 ortonomy

@ortonomy We have that exact same issue after increasing our use of AppSync!

We have one express based server (Elastic Beanstalk) which is still using native DB connection to PG (the Aurora Serverless cluster) but it is limited to a max. of 10 connections. We have a bunch of Lambdas, all using the Data-API and now four AppSync API's. Ever since we added the AppSync API's we run into a lot more Too many connections issues... :(

I've had a few chats with AWS support and (they are helpful and try) but no real solution as of yet. They (kind of) agree with my assumption that a lot of updates/inserts prevents the cluster from scaling and only when the updates/inserts are all committed it scales (but then it's too late). We've rebuilt some of the functions that hits the hardest to output queries to an SQS FIFO queue and then we go through that queue in a more controlled fashion...

This is very annoying and I hope that V2 of serverless fixes this (although our AWS contact and the support are very tight-lipped about it, even when it's going to be released for Postgres...)

QAnders avatar Jan 24 '22 07:01 QAnders

@QAnders and anyone else that cares -- I got a reply from AWS premium support and they have been able to repro the issue:

image

will update when I get a response!

ortonomy avatar Jan 25 '22 01:01 ortonomy

Hi @ortonomy, I have the same problem, have you received any updates from AWS support?

desoss avatar Feb 28 '22 10:02 desoss

Thanks for this issue writeup, @guidev, @jeremydaly and others. We've recently moved to an Aurora Serverless (postgres) and have seen the same issues. We've had an ingestion task that parallel inserts via Lambdas and the Data API and it was hitting connection limits. We've put an SQS FIFO queue in place now to throttle it down which has helped however the performance we're seeing is still pretty poor vs native clients. It certainly seems not to be using connections optimally and the CPU utilisation is higher than I would expect. We've also recently started using it behind AppSync and so we'll keep an eye out for the issue that @ortonomy raised as well.

@nitesmeh, do you have any updates on this Data API connections issue? If there's an imminent Data API being released for Aurora Serverless v2 that'd be great too - it's the main reason we've not moved to v2 yet (bonus marks if it solves the 1MB data limit!). Thanks for your efforts - the Data API has huge potential when some of these issues are resolved.

aschafs avatar Jun 28 '22 11:06 aschafs

I've had similar issues with Serverless RDS and Data API, we were using knex and knex-aurora-data-api-client NPM driver, which is using this data-api-client. We were constantly hitting Data API limit of 500 connections. Yes, you can have any size of Serverless RDS ACU, but Data API can only use 500 connections, so having more than 8 ACU is not going to change anything with problem of number of connections.

Our solution was to properly close transactions with function as we were using a lot of transactions but implementation of knex-aurora-data-api-client for "starting transaction" was properly calling DataAPI function, but one for "commiting transaction" was only sending COMMIT command and not using commitTransaction function of data-api-client. And Data API is waiting 5 minutes for not committed transactions to end. Data in database was already committed (thanks to COMMIT command), but connection to Data API was not considered as committed transactions as Data API requires to call explicit commit of rollback (and you don't need to call SQL COMMIT). After changing implementation our problem of hitting 500 connections was solved immediately and now we are at maximum 20 connections. Hopefully it would help somebody. Don't know what is AppSync using.

@aschafs I don't think they will be having Data API for Aurora Serverless v2 as they have newer RDS Proxy, which is already supported in v2. So my opinion is Data API will be deprecated with v1. And RDS Proxy is the solution for 1MB data limit. I am preparing to move to Aurora Serverless v2 with RDS Proxy in near future.

deel77 avatar Jun 28 '22 20:06 deel77

Well, this is a farse from AWS... We found the same commit @deel77 some time ago (sorry for not thinking about reporting it here)...

We were eagerly awaiting Serverless v2 (and was part of the beta) and I testing it thoroughly and it was scaling nicely even with a lot of open transactions. At launch (GA) we got the information that the Data-API won't be added in v2...

We've move a lot of our backend workloads over to AppSync for the sole purpose of being able to query DB's directly using VTL and the fact that AppSync is using the Data-API to "smooth out" the peaks.

It's not working! Several times a week we hit "too many connections" and as @deel77 found, no go as 500 is the max and it's not possible to raise it. None of these limitations were mentioned in "architectural meetings" with AWS experts prior to moving to the solution...

So, for us, the only viable solution is moving to Serverless v2 (or standard RDS) but as we have a very uneven load Serverless would make more sense. However, v2 is quite costly, and of course no Data-API.... We'd have to rebuild all our AppSync API's to use Lambdas, and then why use AppSync at all?

AWS really dropped the ball on this one!

QAnders avatar Jun 29 '22 08:06 QAnders

@deel77:

I've had similar issues with Serverless RDS and Data API, we were using knex and knex-aurora-data-api-client NPM driver, which is using this data-api-client. We were constantly hitting Data API limit of 500 connections. Yes, you can have any size of Serverless RDS ACU, but Data API can only use 500 connections, so having more than 8 ACU is not going to change anything with problem of number of connections.

Our solution was to properly close transactions with function as we were using a lot of transactions but implementation of knex-aurora-data-api-client for "starting transaction" was properly calling DataAPI function, but one for "commiting transaction" was only sending COMMIT command and not using commitTransaction function of data-api-client. And Data API is waiting 5 minutes for not committed transactions to end. Data in database was already committed (thanks to COMMIT command), but connection to Data API was not considered as committed transactions as Data API requires to call explicit commit of rollback (and you don't need to call SQL COMMIT). After changing implementation our problem of hitting 500 connections was solved immediately and now we are at maximum 20 connections. Hopefully it would help somebody. Don't know what is AppSync using.

@aschafs I don't think they will be having Data API for Aurora Serverless v2 as they have newer RDS Proxy, which is already supported in v2. So my opinion is Data API will be deprecated with v1. And RDS Proxy is the solution for 1MB data limit. I am preparing to move to Aurora Serverless v2 with RDS Proxy in near future.

what did your errors look like for Aurora Serverless v2 PostgreSQL, if you don't mind my asking? I experienced the following myself:

An error occurred (ThrottlingException) when calling the ExecuteStatement operation (reached max retries: 2): Concurrent connections limit exceeded

... despite there only being a handful of active processes in pg_stat_activity and ultimately opted to abandon the Data API entirely

obataku avatar Jul 03 '22 03:07 obataku

cc @bashshak @nitesmeh and whomever else from the RDS Data API team; also, as an aside, does the Data API internally leverage RDS Proxy at all? or does it use its own separate, Data API-specific connection pooling?

obataku avatar Jul 03 '22 15:07 obataku

@obataku From the discussions I've had with AWS and can "reveal" it would seem that Aurora Serverless v2 is a complete rebuild of Aurora (non-Serverless) where they've added the "serverless" part from v1... So, kind of what Aurora Serverless should have been from the beginning...

The Data-API might of course share code with the RDS Proxy, but the Data-API is completely ingrained into v1 as I have understood it and it won't be possible to "port" over to v2.

This is a disaster for us that have moved to AppSync and have a micro-service architecture with a lot of databases where the architecture with AppSync and VTL with direct DB connection was a superb solution!

So, what do we do now...?

We have a crappy DB solution that won't scale and we are capped at 500 connections which is hampering us and forces us to build work-arounds and sweat each day that we hit the max...

Or we have a shiny new (and expensive) DB that is doing it's job, but we can't continue using AppSync... or, rather, we'd have to rebuild a lot, throw all our VTL's out the window and add Lambda's (and in that case why use AppSync at all as we are not a "GraphQL" shop)

QAnders avatar Jul 04 '22 09:07 QAnders

Thanks @deel77 and @QAnders for the feedback. Deprecating the Data API would certainly be disappointing for us too. I'd love to get an official AWS view on it as it does strike me as a step backwards. Like yourselves, we were very excited about coupling AppSync with the Data API though now we may hold off moving in that direction. We might look at RDS Proxy though the appeal of Data API was its REST-based API.

If @nitesmeh or anyone from AWS has any updates that would be great. We'd love to use this pattern and Aurora Serverless generally but it doesn't seem fit-for-purpose for our use cases at the moment.

aschafs avatar Jul 15 '22 06:07 aschafs