bref icon indicating copy to clipboard operation
bref copied to clipboard

FastCgiCommunicationFailed on every second request

Open NeilJ247 opened this issue 4 years ago • 16 comments

Description:

I am seeing a 502 gateway error on every second request made to my api. In cloudwatch logs I can see the following:

START RequestId: 3711f7ba-56d6-41d6-9e70-decaa7f0fcdf Version: $LATEST
--
Error communicating with PHP-FPM to read the HTTP response. A root cause of this can be that the Lambda (or PHP) timed out, for example when trying to connect to a remote API or database, if this happens continuously check for those! Bref will restart PHP-FPM now. Original exception message: hollodotme\FastCGI\Exceptions\WriteFailedException Failed to write request to socket [broken pipe]
[18-Jan-2021 16:54:24] NOTICE: Terminating ...
[18-Jan-2021 16:54:24] NOTICE: exiting, bye-bye!
2021-01-18 16:54:24.566 +0000 (14 14) info: attempt daemon connection via 'newrelic-daemon.REDACTED:31339'
2021-01-18 16:54:24.568 +0000 (14 14) info: Daemon connection settings specify a host different from the local host. Daemon will not be started by the Agent.
2021-01-18 16:54:24.568 +0000 (14 14) info: New Relic 9.15.0.293 ("vanilla" - "bab8316e39f6") [daemon='newrelic-daemon.int.bdry.com:31339'  php='7.4.14' zts=no sapi='fpm-fcgi'  pid=14 ppid=7 uid=993 euid=993 gid=990 egid=990 backtrace=yes startup=init os='Linux' rel='4.14.203-112.332.amzn2.x86_64' mach='x86_64' ver='#1 SMP Wed Nov 4 20' node='169.254.153.173']
[18-Jan-2021 16:54:24] NOTICE: fpm is running, pid 14
[18-Jan-2021 16:54:24] NOTICE: ready to handle connections
3711f7ba-56d6-41d6-9e70-decaa7f0fcdf	Invoke Error	{     "errorType": "Bref\\Event\\Http\\FastCgi\\FastCgiCommunicationFailed",     "errorMessage": "",     "stack": [         "#0 /var/task/vendor/bref/bref/src/Event/Http/HttpHandler.php(25): Bref\\Event\\Http\\FpmHandler->handleRequest()",         "#1 /var/task/vendor/bref/bref/src/Runtime/Invoker.php(29): Bref\\Event\\Http\\HttpHandler->handle()",         "#2 /var/task/vendor/bref/bref/src/Runtime/LambdaRuntime.php(102): Bref\\Runtime\\Invoker->invoke()",         "#3 /opt/bootstrap(43): Bref\\Runtime\\LambdaRuntime->processNextEvent()",         "#4 {main}"     ] }
END RequestId: 3711f7ba-56d6-41d6-9e70-decaa7f0fcdf
REPORT RequestId: 3711f7ba-56d6-41d6-9e70-decaa7f0fcdf	Duration: 37.85 ms	Billed Duration: 38 ms	Memory Size: 1024 MB	Max Memory Used: 83 MB

Currently the endpoint simply returns a 200. There are no database connections. The only connection I can think of is to an internal fargate service running the new relic daemon (this support was just added recently to bref extensions).

How to reproduce:

Bref: 1.1.3 Extensions: bref:extra.newrelic-php-74 - 0.9.5 Application Framework: Slim 4 Runtime: provided.al2

I am also using a VPC with my Lambda function. It has 2 public and and private subnets with the expected NAT gateway.

NeilJ247 avatar Jan 18 '21 17:01 NeilJ247

Thanks for the report, do you get the same problem without New Relic?

What about the previous requests? Any timeout, or any other problem? Are you doing anything specific or weird in your code? Consuming large memory? Are other requests working fine? Any more detail will help.

mnapoli avatar Jan 27 '21 09:01 mnapoli

I am not using the New Relic extension but I have been seeing a similar FastCgiCommunicationFailed error on every second request. This error seems to occur after querying a 3rd party API that returns more than 2 MB worth of data. The initial request works but then any request afterwards will fail with the below error.

Bref: 1.1.3 Application Framework: Slim 4 Runtime: provided.al2

Error communicating with PHP-FPM to read the HTTP response. A root cause of this can be that the Lambda (or PHP) timed out, for example when trying to connect to a remote API or database, if this happens continuously check for those! Bref will restart PHP-FPM now. Original exception message: hollodotme\FastCGI\Exceptions\ReadFailedException Stream got blocked, or terminated.

{
	"errorType": "Bref\\Event\\Http\\FastCgi\\FastCgiCommunicationFailed",
	"errorMessage": "",
	"stack": ["#0 \/var\/task\/vendor\/bref\/bref\/src\/Event\/Http\/HttpHandler.php(25): Bref\\Event\\Http\\FpmHandler->handleRequest(Object(Bref\\Event\\Http\\HttpRequestEvent), Object(Bref\\Context\\Context))", "#1 \/var\/task\/vendor\/bref\/bref\/src\/Runtime\/Invoker.php(29): Bref\\Event\\Http\\HttpHandler->handle(Array, Object(Bref\\Context\\Context))", "#2 \/var\/task\/vendor\/bref\/bref\/src\/Runtime\/LambdaRuntime.php(102): Bref\\Runtime\\Invoker->invoke(Object(Bref\\Event\\Http\\FpmHandler), Array, Object(Bref\\Context\\Context))", "#3 \/opt\/bootstrap(43): Bref\\Runtime\\LambdaRuntime->processNextEvent(Object(Bref\\Event\\Http\\FpmHandler))", "#4 {main}"]
}

codybolinger avatar Jan 27 '21 16:01 codybolinger

Just experienced this as well. We have a API endpoint that responds with simple json extracted from the database. Our production version is working but - but after adding some more data to the API (not a whole bunch) on our staging environment every second request is returning this exact same error. After seeing what @codybolinger wrote I'm thinking it might be the Redis caching we have in front of our database. I'll try disabling this and update my comment with my findings.

Bref: 1.2.0 Application Framework: Laravel 8 Runtime: provided.al2

Invoke Error	
{
    "errorType": "Bref\\Event\\Http\\FastCgi\\FastCgiCommunicationFailed",
    "errorMessage": "",
    "stack": [
        "#0 /var/task/vendor/bref/bref/src/Event/Http/HttpHandler.php(25): Bref\\Event\\Http\\FpmHandler->handleRequest()",
        "#1 /var/task/vendor/bref/bref/src/Runtime/Invoker.php(29): Bref\\Event\\Http\\HttpHandler->handle()",
        "#2 /var/task/vendor/bref/bref/src/Runtime/LambdaRuntime.php(102): Bref\\Runtime\\Invoker->invoke()",
        "#3 /opt/bootstrap(43): Bref\\Runtime\\LambdaRuntime->processNextEvent()",
        "#4 {main}"
    ]
}

Edit:

Tried disabling our Redis caching and it still appears to be an issue. I have no clue on how i can debug this issue to find the root cause. Anyone have any ideas?

Edit 2:

Spent some time refactoring and all of the sudden the error disappeared. I'm not entirely sure why. Sorry if i added to any confusion on this.

Edit 3:

Error came back this morning. Appears that after a number of invocations it fixed it-self. However - i think i found the issue (at least for me). We can continuously provoke this error when sending form data along with a GET request. This was left out when we copied over a Postman POST request and changed it to GET for testing another endpoint. Leaving the form data there caused this particular issue - removing it resolved it. I'm able to confirm this by continuously calling the API and randomly switch between with & without form data.

danniehansen avatar Feb 23 '21 10:02 danniehansen

Hi,

Sorry for the late reply. The issue could be related to me setting up a VPC with my functions. I mistakenly added both public and private subnets to the function when all I needed was private subnets. I haven't seen this issue since.

Sorry if this doesn't help anymore with this!

Thanks.

NeilJ247 avatar Feb 24 '21 17:02 NeilJ247

Also #862 might be related, might be worth giving it a look?

mnapoli avatar Feb 24 '21 18:02 mnapoli

I remember having this error too. I was experimenting with proving my own php-fpm.conf. I did something which lead to it working every second request.

I was not related to any extension. But it was maybe related to FPM crashes... Or it might be very separate from the problem @NeilJ247 is experiencing.

Nyholm avatar Mar 18 '21 16:03 Nyholm

I am having this issue under some circumstances and it is very strange.

While deploying manually from my machine, everything have been working perfectly for past year. Right now I created a CI/CD pipeline on Github and found some strange things happening.

First, there is the zip-size mystery. When deploying from my local machine the zip-file is 200 kb larger than while deploying using Github Actions. Script to deploy is the same, contents of the zip-file are the same (the only differences are few hashes in composer auto-load files) and also the size after extraction is the same. It wouldn't have had much relevance to this project, if it weren't my only clue on why I am also getting the FastCgiCommunicationFailed error. While deploying from my machine, everything works fine, but when our app gets deployed using Github Actions, it results in getting this error on every second request.

I am currently clueless on why is it happening.

Error communicating with PHP-FPM to read the HTTP response. Bref will restart PHP-FPM now. Original exception message: hollodotme\FastCGI\Exceptions\ReadFailedException Stream got blocked, or terminated.
"errorType": "Bref\\Event\\Http\\FastCgi\\FastCgiCommunicationFailed",
    "errorMessage": "",
    "stack": [
        "#0 /var/task/vendor/bref/bref/src/Event/Http/HttpHandler.php(25): Bref\\Event\\Http\\FpmHandler->handleRequest()",
        "#1 /var/task/vendor/bref/bref/src/Runtime/Invoker.php(29): Bref\\Event\\Http\\HttpHandler->handle()",
        "#2 /var/task/vendor/bref/bref/src/Runtime/LambdaRuntime.php(91): Bref\\Runtime\\Invoker->invoke()",
        "#3 /opt/bootstrap(43): Bref\\Runtime\\LambdaRuntime->processNextEvent()",
        "#4 {main}"
    ]

EDIT: Right now even deployment from my machine turned the wrong way, which means it is probably somewhat random?

peldax avatar Feb 15 '22 12:02 peldax

Update:

I had no choice but to blindly try various tweaks and figure out what causes the problem. Issue disappeared after disabling JIT in PHP config.

The issue was caused by segmentation fault in PHP with JIT enabled.

When segmentation fault occurs, the process does not exit in expected way and does not log any error, but the process is terminated and the hollodotme\FastCGI library correctly detects that stream is not available anymore and throws. It also explains somewhat random occurrence I had - I need to dig deeper and figure out what causes the SIGSEGV.

peldax avatar Feb 15 '22 21:02 peldax

Having it too, @peldax which php version? looks weird that JIT causes SIGSEGV, have you found out why?

shouze avatar Mar 15 '22 13:03 shouze

@shouze No, I haven't figured it out yet. It is very random in my case.

In my PHP GraphQL server library (GraPHPinator) I use Github actions to run tests with matrix of PHP version (8.0 and 8.1) and JIT (enabled/disabled). In branch of PHP 8.1 + JIT enabled it sometimes randomly crashes with segfault, but when you run it again it succeeds. It is somewhat rare and I haven't been able to reproduce.

Edit: Latest PHP 8.1 version.

peldax avatar Mar 15 '22 13:03 peldax

Interesting, if you have recommendations on how to better handle segmentation faults inside Bref that's also something we can consider (i.e. to harden against errors like these, because they may happen).

mnapoli avatar Mar 16 '22 20:03 mnapoli

It might be helpful to add some suggestions into error message - something like FastCgiCommunicationFailed. This may be caused by child process segfaulting.. If I knew it may be segfault then disabling JIT would be the first thing I would try.

peldax avatar Mar 16 '22 21:03 peldax

I'm facing the same issue only in one route of my API:

Error communicating with PHP-FPM to read the HTTP response. Bref will restart PHP-FPM now. Original exception message: hollodotme\FastCGI\Exceptions\WriteFailedException Failed to write request to socket

It's a PATCH route, but I found it weird because as far as I can tell it's the only route that is reporting this issue, and it's on every second request.

lagomf avatar May 18 '22 11:05 lagomf

I stumbled across this exact same error on a PUT request failing every other time. Disabling JIT or even OPcache entirely did not resolve issue.

After some more digging, I noticed the client was sending an empty JSON object {} without the header Content-Type: application/json. Recreating the request without the header in Postman allowed me to consistently replicate the error response every other request.

The request is technically malformed, but the erorr is not clear and that fact it works every other time is a bit of a head scratcher!

I hope this helps someone else!

sephallen avatar Aug 15 '23 08:08 sephallen

@sephallen oh that is amazing! I cannot reproduce it though 🤔

Here's what I tried: I deployed an empty index.php and HTTP function (HTTP API) with the php-83-fpm runtime.

functions:
    test:
        handler: index.php
        runtime: php-83-fpm
        events:
            -   httpApi: '*'

Then I sent a PUT request like this:

PUT https://xxxx.execute-api.us-east-2.amazonaws.com

{}

(no Content-Type header, the body is {})

I got success every time, no bug. I also tried with different Content-Type headers but no problem.

Do you have more details? A simpler reproducible case? Are you using a specific framework? Is this an HTTP API or a REST API? (can you share your serverless.yml or part of it?)

mnapoli avatar Aug 15 '23 12:08 mnapoli

@mnapoli that's interesting!

We are using bref 2.1.2 with Laravel 9.52.9. It uses the eu-west-2 php-82-fpm runtime (arn:aws:lambda:eu-west-2:534081306603:layer:php-82-fpm:33).

We are using the REST API gateway and this particular endpoint was using a simple {proxy+} resource.

We only use the serverless.yml to package up the codebase and deploy as part of our CI/CD, it is very basic and looks like this:

functions:
    website:
        handler: public/index.php
        timeout: 28
        runtime: php-82
        events:
            - http: "ANY /"
            - http: "ANY /{proxy+}"

This lambda is just a single microservice in a larger microservice based system which has its own CloudFormation template in a seperate repo for the infrastructure.

If you need any more information, I'm happy to provide it.

sephallen avatar Aug 15 '23 14:08 sephallen