bref Handle Lambda timeouts and PHP-FPM crashes better

One of our Lambda functions running in eu-west-2 started getting intermittant errors logging in Cloudwatch stating: "Error: signal: terminated Extension.Crash"

When this occurs, none of our application code is executed and the request just fails after ~2 seconds with API Gateway then returning a 500 error on the affected requests.

Our function only runs 3 layers and has done so for months, the issues started for 2 hours on 2021-02-08, then again around the same time in the evening last night 2021-02-09:

${bref:layer.php-74-fpm}
${bref:extra.memcached-php-74}
${bref:extra.redis-php-74}

For clarity, there have been no changes to the application or environment by ourselves during these times and we have no time-based logic that would account for the issues starting in the evening for a few hours two days running.

Not sure if this is a Bref / Bref Extra issue or an AWS one, the Lambda team have investigated and believe it is to do with the layers, so any help/guideance to track things down further would be greatly appreciated.

Feb 10 '21 10:02 aran112000

We've managed to track down the issue to PHP-FPM having issues as can be seen from an extract from the logs below, has anyone encountered this before? For us, we've had it the last two nights only then it seems to eventually resolve itself but affecting over 1k invocations sporadically.

Feb 10 '21 17:02 ghost

@evoaran we are experiencing this issue as well. It fixes itself i think when lambda spawn another micro container instance.

Feb 11 '21 01:02 matthewsuan

Hi! I haven't forgotten about this, I'm just short of ideas on what to do here. FYI there's also #842 that could be related, it may be worth giving it a look.

it seems to eventually resolve itself but affecting over 1k invocations sporadically.

Right, as @matthewsuan mentioned this might be a case of the Lambda instance that is in a "broken" state is replaced by another Lambda instance.

Now the questions would be:

what causes Lambda/PHP-FPM to enter a broken state?
how can we improve Bref so that we never enter a broken state? (i.e. OK to have a failed request, but not the whole container)

Judging from #842 and past experience, could the initial failures be triggered by a large file upload or large HTTP response?

Feb 12 '21 17:02 mnapoli

Hi @mnapoli, @matthewsuan,

After combing through the logs again this evening, I think I've found the cause that starts it each time - a Lambda timeout:

Task timed out after 28.53 seconds

What I'm still not sure of though, is what caused the initial timeout, it's just a normal HTTP request and our dependant services for this request example (Redis, Memcached & MySQL RDS) are all running perfectly at this time and have connection timeouts low enough we'd never reach this timeout.

To address your specific question, we're seeing a mix of GET and POST requests starting this issue and we don't process any file uploads via Lambda so can rule that out too. HTTP responses aren't "large", 45kb for the largest responses we're seeing being affected.

Could it just be the Lambda host has died perhaps?

On a slightly different topic, would you be happy to accept a PR to change the Cloudwatch log format for Bref to convert these FPM related messages to include a JSON payload which can have more context details (Lambda Request ID and API Gateway request ID - where applicable) as this would really help to try and tie various log entries together when issues like this occur.

Feb 12 '21 23:02 aran112000

I went and checked logs and these same errors happen during occasional timeouts (28 secs) as well though I expect timeouts from time to time with my application.

I think when this happens, FPM runtime was not able to properly shutdown during Shutdown phase which is only capped at 2 secs before lambda forces sigkill.

Feb 13 '21 09:02 matthewsuan

Thanks for those details! So we might need to adjust the current recovery mechanism in case of Lambda timeouts:

https://github.com/brefphp/bref/blob/e94df971c6f96f00d5683e3b474913fd293b46a0/src/Event/Http/FpmHandler.php#L214-L258

Maybe we need to send a stronger "kill" signal in case we can kill PHP-FPM here:

https://github.com/brefphp/bref/blob/e94df971c6f96f00d5683e3b474913fd293b46a0/src/Event/Http/FpmHandler.php#L273

Any other idea?

On a slightly different topic, would you be happy to accept a PR to change the Cloudwatch log format for Bref to convert these FPM related messages to include a JSON payload which can have more context details (Lambda Request ID and API Gateway request ID - where applicable) as this would really help to try and tie various log entries together when issues like this occur.

Do you mean to change the format of all the logs written by PHP applications? Or specifically the logs written by Bref when managing PHP-FPM?

In short, I'm 👍 on improving Bref-related exceptions and logs. But changing the formats of all application logs to JSON might be a too big of a change.

Feb 13 '21 19:02 mnapoli

Completely agree, perhaps we look at swapping out this... https://github.com/brefphp/bref/blob/e94df971c6f96f00d5683e3b474913fd293b46a0/src/Event/Http/FpmHandler.php#L249

...to use KILL (9) rather than TERM (15) to ensure it actually terminates regardless of what might be going on behind the scenes.

Regarding the logs, I was purely thinking about the FPM-related logging to help track down what's going on here more easily so I'll take a look at a PR to add this in shortly.

Feb 13 '21 23:02 ghost

Do you think it might be worthwhile stipulating a timeout on FPM that's a little lower than the Lambda timeout in order to allow a little extra time for everything to wrap up successfully?

https://github.com/brefphp/bref/blob/master/src/Event/Http/FpmHandler.php#L79

Feb 14 '21 00:02 aran112000

Pasting this here just because fiddling around with timeout was mentioned and is extremely tricky

https://github.com/brefphp/bref/pull/770 https://github.com/brefphp/bref/pull/772

Feb 14 '21 08:02 deleugpn

I think we are seeing this issue. We have it where our uptime check fails (within 2 seconds), this seems to occur when other requests have timed out so I figure it's the same issue where php-fpm is in a bad state when the runtime is being reused by our uptime ping request.

Has anyone been able to work around this issue?

Nov 25 '21 16:11 clarkeash

We still see it following a timeout (normally after RDS Serverless fails to scale)

Nov 25 '21 17:11 aran112000

This PR may help to handle timeouts with FPM better, including flushing all logs (and possibly restarting the child FPM worker in a cleaner way): #1133

Jan 04 '22 16:01 mnapoli

Since its been about a year since the last reports, and we have merged #1133 too, I'll go ahead and close this one.

Dec 01 '22 17:12 mnapoli