nodejs-spanner Random timeout/deadline exceeded unhandled rejections

Environment details

OS: Cloud Functions environment (Ubuntu 18.04)
Node.js version: NodeJS 16
npm version: Not sure, the one used by Cloud Functions
@google-cloud/spanner version: 6.1.2, but the issue has been happening since versions 5.x

Steps to reproduce

Run a Cloud Function using the Spanner client with an opened database, even without making requests.

Context

I've been using the NodeJS Spanner client for several projects for about a year now, and I've consistently noticed DEADLINE_EXCEEDED and other related errors not directly linked to my requests, usually when the client hasn't been used for several minutes. What I mean by that is that those appear as unhandled rejections in NodeJS. Cloud Functions makes this more visible as unhandled rejections make the functions framework crash and log the error. This issue occurs in many of my Cloud Functions performing various unrelated tasks using Spanner. Again, business code is never impacted as the unhandled rejections usually occur when the Cloud Function has been inactive for several minutes (or even several hours). This also means that this bug is much more visible when the minimum number of instances is set to >= 1, as the Cloud Function instances stay idle longer (or even indefinitely). I see no reason why this wouldn't happen in other environments, but I've mainly been using Cloud Functions lately and can only testify for the behaviour in this context. All Cloud Functions show fairly low activity (most of the time they are stuck at min_instances: 1 and sometimes reach 5 instances). The Cloud Spanner instance is running with 200 processing units and is underutilised (it never goes above 10% CPU).

The problem here is really about the rejections not being caught by the client. I'm not really worried about the errors themselves as they never seem to occur when running business operations/transactions (or at least they are being caught and properly retried in this case).

My only lead would be the keepAlive mechanism as it uses setInterval and periodically makes requests to Spanner, especially when sessions aren't being used. However after looking at the code I haven't seen anything obvious.

Stack traces

Unfortunately as those are unhandled rejections the stack traces are usually not very helpful, but one of them clearly references the Spanner client. To give you a rough idea of the rate of occurence of the issue, it has happened 33 times in 10 days for around 100 Cloud Functions, most of which have min_instances: 1 set. Out of the 33 errors, 4 are due to CANCELLED, 3 are Total timeout of API..., and the rest are DEADLINE_EXCEEDED.

Unhandled rejection
Error: Total timeout of API google.spanner.v1.Spanner exceeded 60000 milliseconds before any response was received.
    at repeat (/workspace/node_modules/google-gax/build/src/normalCalls/retries.js:66:31)
    at Timeout._onTimeout (/workspace/node_modules/google-gax/build/src/normalCalls/retries.js:101:25)
    at listOnTimeout (node:internal/timers:559:17)
    at processTimers (node:internal/timers:502:7)
Error: Process exited with code 16
    at process.<anonymous> (/workspace/node_modules/@google-cloud/functions-framework/build/src/invoker.js:92:22)
    at process.emit (node:events:527:28)
    at process.emit (node:domain:475:12)
    at process.exit (node:internal/process/per_thread:189:15)
    at sendCrashResponse (/workspace/node_modules/@google-cloud/functions-framework/build/src/logger.js:44:9)
    at process.<anonymous> (/workspace/node_modules/@google-cloud/functions-framework/build/src/invoker.js:88:44)
    at process.emit (node:events:527:28)
    at process.emit (node:domain:475:12)
    at emit (node:internal/process/promises:140:20)
    at processPromiseRejections (node:internal/process/promises:274:27)

Unhandled rejection
Error: 4 DEADLINE_EXCEEDED: Deadline exceeded
    at Object.callErrorFromStatus (/workspace/node_modules/@grpc/grpc-js/build/src/call.js:31:26)
    at Object.onReceiveStatus (/workspace/node_modules/@grpc/grpc-js/build/src/client.js:189:52)
    at /workspace/node_modules/@grpc/grpc-js/build/src/call-stream.js:111:35
    at Object.onReceiveStatus (/workspace/node_modules/grpc-gcp/build/src/index.js:73:29)
    at InterceptingListenerImpl.onReceiveStatus (/workspace/node_modules/@grpc/grpc-js/build/src/call-stream.js:106:23)
    at Object.onReceiveStatus (/workspace/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:365:141)
    at Object.onReceiveStatus (/workspace/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:328:181)
    at /workspace/node_modules/@grpc/grpc-js/build/src/call-stream.js:187:78
    at processTicksAndRejections (node:internal/process/task_queues:78:11)
    at runNextTicks (node:internal/process/task_queues:65:3)
// Same Error: Process exited with code 16 coming from the functions-framework.

Unhandled rejection
Error: 1 CANCELLED: Call cancelled
    at Object.callErrorFromStatus (/workspace/node_modules/@grpc/grpc-js/build/src/call.js:31:26)
    at Object.onReceiveStatus (/workspace/node_modules/@grpc/grpc-js/build/src/client.js:189:52)
    at /workspace/node_modules/@grpc/grpc-js/build/src/call-stream.js:111:35
    at Object.onReceiveStatus (/workspace/node_modules/grpc-gcp/build/src/index.js:73:29)
    at InterceptingListenerImpl.onReceiveStatus (/workspace/node_modules/@grpc/grpc-js/build/src/call-stream.js:106:23)
    at Object.onReceiveStatus (/workspace/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:365:141)
    at Object.onReceiveStatus (/workspace/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:328:181)
    at /workspace/node_modules/@grpc/grpc-js/build/src/call-stream.js:187:78
    at processTicksAndRejections (node:internal/process/task_queues:78:11)
// Same Error: Process exited with code 16 coming from the functions-framework.

Jul 13 '22 14:07 flovouin

Hi @flovouin, you mentioned the keepAlive mechanism in sessions. Do these errors disappear when you configure the session pool to have a lower keepAlive value?

Jul 14 '22 15:07 asthamohta

Hi @asthamohta, thank you for your response.

I haven't tried lowering the keepAlive value, I'll do that.

I did however lower the number of sessions to avoid too many unused session. Indeed, each Cloud Functions (V1) instance only handles one request at a time, which in my case is basically one transaction at a time. I've set { min: 5, max: 20, incStep: 5 } down from 25/100/25. (By the way it looks like the documentation doesn't match the default values.) I haven't changed the maxIdle: 1 parameter because from what I understand, min will always enforce to keep more sessions anyway.

This might have lowered the number of errors, but it's only speculation for now. I ran this configuration over the past few days and had a single error... 3 days after the last request to the Cloud Function:

Jul 18 '22 05:07 flovouin

Thanks for pointing out the error in documentation, will fix that. What I am actually trying to figure out by decreasing the keepAlive value is if somehow the background tasks which keep the sessions alive gets killed in Cloud Function because of which we are getting this error. Can you let me know if simply decreasing the keepAlive value solves the issue?

Jul 18 '22 13:07 asthamohta

I set keepAlive to 15 minutes (down from the default of 30) however it did not improve the situation. It actually looks like it worsened it, which could make sense: decreasing keepAlive means that pings occur more often, which could increase the probability of getting an error during the keep alive request.

Here's an example of what happened last night for a Cloud Function with min_instances: 1. The Cloud Function was deployed and never called. However many hours after deployment the unhandled rejection occurred several times:

It did not happen for any other Cloud Functions (tens of them) around the same time. However, as previously stated, the error does occur for any Cloud Function (at other times) independently of the business logic it implements.

I don't know if these errors are "expected" from time to time from Spanner or not, but the real problem for me is that the rejection is not handled, meaning it creates a lot of noise and errors in our logs.

Jul 19 '22 07:07 flovouin

Okay so I've got a reproducible way to generate unhandled rejections. Create a test instance/database in a GCP project and allow a service account to read from it. Run the following code (it can be run locally with the service account credentials):

import { Spanner } from "@google-cloud/spanner";

async function test() {
  try {
    const database = new Spanner().instance("test").database("test", {
      keepAlive: 0.1,  // Pings the sessions every 6 seconds.
      idlesAfter: 0.01,  // Ensures newly created sessions are pinged.
    });

    await new Promise((resolve) => setTimeout(resolve, 600000));
  } catch (error) {
    console.log(`Caught error: ${error.stack}`);
  }
}

test();

process.on("unhandledRejection", (reason) => {
  console.log("This was not caught anywhere:");
  console.dir(reason, 10);
});

As soon as the code has started, remove the permissions from the service account. It may take up to a minute, but eventually pings on the sessions will fail and this will result in unhandled rejections:

Error: 7 PERMISSION_DENIED: Caller is missing IAM permission spanner.sessions.delete on resource projects/my-project/instances/test/databases/test.
    at Object.callErrorFromStatus (/test-spanner/node_modules/@grpc/grpc-js/build/src/call.js:31:26)
    at Object.onReceiveStatus (/test-spanner/node_modules/@grpc/grpc-js/build/src/client.js:189:52)
    at /test-spanner/node_modules/@grpc/grpc-js/build/src/call-stream.js:111:35
    at Object.onReceiveStatus (/test-spanner/node_modules/grpc-gcp/build/src/index.js:73:29)
    at InterceptingListenerImpl.onReceiveStatus (/test-spanner/node_modules/@grpc/grpc-js/build/src/call-stream.js:106:23)
    at Object.onReceiveStatus (/test-spanner/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:365:141)
    at Object.onReceiveStatus (/test-spanner/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:328:181)
    at /test-spanner/node_modules/@grpc/grpc-js/build/src/call-stream.js:187:78
    at processTicksAndRejections (node:internal/process/task_queues:78:11) {
  code: 7,
  details: 'Caller is missing IAM permission spanner.sessions.delete on resource projects/my-project/instances/test/databases/test.',
  metadata: Metadata {
    internalRepr: Map(1) { 'grpc-server-stats-bin' => [Array] },
    options: {}
  },
  note: 'Exception occurred in retry method that was not classified as transient'
}

Obviously, the error itself is legitimate. However I'd argue that it should be handled in the session pool logic as there's no way to catch it outside the Spanner client. My guess is that something similar occurs with the errors I'm seeing on real use cases.

Jul 20 '22 07:07 flovouin

This makes sense. Give me some time to figure out which part of the code the error should get caught in and I will add code to catch this. I am going to be on vacation for a week but I will look into this when I get back.

Jul 21 '22 21:07 asthamohta

@flovouin Have you found a good work around here? I am finding this happens to my cloud functions (all have min instances > 1) and it is within business critical paths. Requests directly from the frontend that just time out. It seems to happen consistently until I redeploy the function.

Also nice find about the defaults not matching the documentation. I was about to start playing with those parameters and handnt noticed the defaults dont match up.

I was going to add something to like, close and recreate the spanner connection after a certain amount of time or on every request or something to try to solve this, but not sure if thats the best course of action.

FWIW I'm actually using cloud functions V2.

Jul 29 '22 01:07 dannnnthemannnn

Hey @dannnnthemannnn, No unfortunately I don't have a proper workaround. I looked a bit at the code for session management / the session pool, and indeed I was thinking I could force the recreation of the sessions before the problem occurs or something like this.

However it looks like there are two problems here:

On my side the errors don't occur on the critical path and the problem is that I can't catch them. They're causing a lot of noise in our logs and alerting policies, and make the Cloud Functions crash.
The fact that those errors occur so frequently, and for you it's even worse as they happen during responses to client. It's a bit suspicious that a service like Spanner would timeout so frequently, but I don't have much heavy production use experience with it. Maybe this is worth a Support ticket?

Aug 02 '22 06:08 flovouin

@flovouin Can you please share the grpc-js version used in your cloud function ?

Aug 05 '22 07:08 surbhigarg92

@surbhigarg92 The issue has been happening for a while now, so probably a lot of versions of grpc-js have been used. Currently I'm using the latest (1.6.8) and the errors still occur. I'm not depending on grpc-js directly though, only through other GCP dependencies like the Spanner client.

Aug 05 '22 07:08 flovouin

Hi @flovouin just trying a couple of ideas here, can you please try increasing the timeout limit in the function details (max time is 540 seconds, default is 60 seconds). This is how you can do it:

/**
* Set timeout and memory allocation
* In some cases, your functions may have special requirements for a long timeout value or a large * allocation of memory.
*You can set these values either in the Google Cloud Console or in the function source code (Firebase only).
*/
const runtimeOpts = {
  timeoutSeconds: 540, // 9 minutes
  memory: '2GB'
}
/**
 * When a csv file is uploaded in the Storage bucket it is parsed automatically using
 * PapaParse.
 */
exports.covertToJSON = functions
  .runWith(runtimeOpts) // <== apply timeout and memory limit here!
  .storage
  .object()
  .onFinalize( object => {
[...]

Aug 09 '22 04:08 asthamohta

For me when this happens it is on a function that usually takes <1sec and now suddenly it times out at 60 seconds consistently until I redeploy the function. I have a log alert setup to notify and I redeploy the function and it works ok, but until I redeploy our app is just broken.

Aug 09 '22 18:08 dannnnthemannnn

Hi all, just wanted to ping here again. This is completely breaking my cloud functions until I redeploy them. Today I am going to attempt to detect the problem, then close the spanner connections and reopen them again but wanted to see if anyone else had solutions or if a fix might be in the works.

Aug 22 '22 20:08 dannnnthemannnn

Hey @asthamohta, Sorry I was only holidays, just seeing this now.

I can change this if you want, but as I was explaining:

business code is never impacted as the unhandled rejections usually occur when the Cloud Function has been inactive for several minutes (or even several hours).

None of my Cloud Functions takes more than 30 seconds to run, so I don't expect changing the timeout to do anything. The Cloud Functions' invocations always complete successfully within the current timeout of 60 seconds.

Aug 23 '22 05:08 flovouin

I see, also what we are suspecting is that the background tasks get killed when the function is completed and when a new call comes and it is restarted, spanner tries to ping the sessions which fails. Can you also try passing your session pool settings as following to check if this might be the case:

{
    idlesAfter: 500000, // This will make sure a session is (almost) never considered idle
    keepAlive: 500000, // This will practically disable the keep-alive calls for sessions
    acquireTimeout: 15000,
    concurrency: 100,
    idlesAfter: 5,
    max: 400,
    min: 25,
    incStep: 25,
    fail: true,
}

Aug 24 '22 04:08 asthamohta

I've deployed the Cloud Functions using the Spanner session pool configuration from your message. I only changed min, max, and incStep because Cloud Functions (V1) don't need a high number of sessions as they serve a single request at a time. I will let you know in the next few days whether this change affects the number of unhandled rejections I'm observing.

Aug 24 '22 13:08 flovouin

Sure let me know :)

Aug 26 '22 05:08 asthamohta

Ha, I was just about to write back! I haven't noticed any error in the past 2 days on the environment where I made the configuration change. I just had to use 30000 minutes instead of 500000 as a "big number", otherwise the duration doesn't fit in a signed 32 bit integer once converted to milliseconds, and setTimeout/setInterval complains (and falls back to a 1 millisecond interval).

I'm not entirely sure what this configuration implies performance-wise. From what I understand, idle sessions older than 1 hour will get deleted anyway, and new ones will be created?

Still taker of a long term solution, though. :-)

Aug 26 '22 05:08 flovouin

Thanks @flovouin, you're correct about the session thing. The long term solution is Cloud Function servers don't completely kill all background activity in case there are no requests. I will check in with my team if any such request is being worked upon currently.

Aug 26 '22 05:08 asthamohta

Okay, thanks. I guess this will do the trick for now.

I think it would still be worth looking into unhandled rejections though, as this is not directly linked to Cloud Functions and could happen anywhere. :-)

Aug 26 '22 06:08 flovouin

@asthamohta My situation is slightly different here. My cloud function is running on cloud functions v2 which I think is basically just a long running server. (Its possible it doesnt give it compute resources though while requests are not in flight.). These functions run on cloud run.

For us every once and awhile we make a spanner call and i get a DEADLINE_EXCEEDED error that says Exception occurred in retry method that was not classified as transient. These calls generally take <5 seconds and then suddenly are timing out at 60s. I was thinking it could be due to an unhandled rejection so it eventually times out since the exception wasnt handled anywhere. Would the configuration you mentioned above solve this problem?

THe problem persists until I redeploy the function. Though I now am going to detect the failure, close the spanner connection and reopen it, and also close it once every hour and restart it. Unfortunately I think if we have any requests in flight while I am doing this close/reopen I assume they will fail in the meantime.

Any guidance would be super helpful! For our ~30 functions this happens maybe once or twice a day to one of the functions with no real pattern in which one it happens to.

Unhandled error Error: 4 DEADLINE_EXCEEDED: Deadline exceeded
    at Object.callErrorFromStatus (/workspace/node_modules/@grpc/grpc-js/build/src/call.js:31:19)
    at Object.onReceiveStatus (/workspace/node_modules/@grpc/grpc-js/build/src/client.js:190:52)
    at /workspace/node_modules/@grpc/grpc-js/build/src/call-stream.js:111:35
    at Object.onReceiveStatus (/workspace/node_modules/grpc-gcp/build/src/index.js:73:29)
    at InterceptingListenerImpl.onReceiveStatus (/workspace/node_modules/@grpc/grpc-js/build/src/call-stream.js:106:23)
    at Object.onReceiveStatus (/workspace/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:365:141)
    at Object.onReceiveStatus (/workspace/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:328:181)
    at /workspace/node_modules/@grpc/grpc-js/build/src/call-stream.js:187:78
    at processTicksAndRejections (node:internal/process/task_queues:78:11)
for call at
    at ServiceClientImpl.makeUnaryRequest (/workspace/node_modules/@grpc/grpc-js/build/src/client.js:160:30)
    at ServiceClientImpl.<anonymous> (/workspace/node_modules/@grpc/grpc-js/build/src/make-client.js:105:19)
    at /workspace/node_modules/@google-cloud/spanner/build/src/v1/spanner_client.js:214:29
    at /workspace/node_modules/google-gax/build/src/normalCalls/timeout.js:44:16
    at repeat (/workspace/node_modules/google-gax/build/src/normalCalls/retries.js:80:25)
    at /workspace/node_modules/google-gax/build/src/normalCalls/retries.js:118:13
    at OngoingCall.call (/workspace/node_modules/google-gax/build/src/call.js:67:27)
    at NormalApiCaller.call (/workspace/node_modules/google-gax/build/src/normalCalls/normalApiCaller.js:34:19)
    at /workspace/node_modules/google-gax/build/src/createApiCall.js:81:30 {
  code: 4,
  details: 'Deadline exceeded',
  metadata: Metadata { internalRepr: Map(0) {}, options: {} },
  note: 'Exception occurred in retry method that was not classified as transient'
}

Aug 26 '22 06:08 dannnnthemannnn

For Cloud Run perhaps you can try this configuration to allot resources for things to run in the background: https://cloud.google.com/blog/topics/developers-practitioners/use-cloud-run-always-cpu-allocation-background-work. Can you try if this resolves the issue? What my guess is that when the function restarts, our code tries to ping the session and since the sessions are deleted by now it fails. I think these configurations should solve the issue but before that see if the "always-on" works for you because that would be a better solution

Aug 26 '22 06:08 asthamohta

@asthamohta I looked into turning on cpu allocation always allocated but unfortunately it looks like that is difficult to do in cloud functions right now.

Are the settings above good to use in prod moving forward to avoid this error do you think? Or do you have any advice on what the best settings would be? Is there anything that can tell nodejs-spanner to recover when it hits that problem of the session being deleted and stuff just timeing out and getting DEADLINE_EXCEEDED until we redeploy?

For reference: My bug for cloud functions with CPU allocation on https://github.com/firebase/firebase-functions/issues/1222

Sep 02 '22 18:09 dannnnthemannnn

@dannnnthemannnn To be honest I wouldn't call these parameters production ready. It just consists in setting a parameter that wasn't made for this to an extremely high value. Setting it too high might even have some side effects.

@asthamohta Unfortunately the error has occurred again after a few days. It's not as frequent as before, but it still occurs. Also, because the errors are still unhandled, they still make noise in Error Reporting / Cloud Logging / our alerting system.

The diagnosis (about Cloud Functions killing background activity) makes sense, and I don't really mind that the keep alive mechanism cannot work in this case. But I'd suggest fixing the error handling like we discussed, because that's unfortunately something we cannot work around outside the Spanner client itself. Also, although I'm aware this is more work, it might be useful to have a proper parameter for disabling the keep alive mechanism when it's expected not to work. Currently it seems that working around it with the existing parameters does not work 100% of the time, and it makes it hard to know whether there is actually something more to look into or if it's just that the existing parameters are not made for this.

Sep 04 '22 13:09 flovouin

nodejs-spanner nodejs-spanner copied to clipboard

Random timeout/deadline exceeded unhandled rejections

Environment details

Steps to reproduce

Context

Stack traces

nodejs-spanner
nodejs-spanner copied to clipboard