go-ethereum icon indicating copy to clipboard operation
go-ethereum copied to clipboard

node: rpc write timeout work-around

Open s1na opened this issue 2 years ago • 3 comments

A work-around for https://github.com/golang/go/issues/47229. Trigger timeout a bit before stdlib closes the connection so users see a proper error message. Note I still kept the WriteTimeout of the http.Server.

I'm not sure what happens if the method handler and the timeout statement both write at the same time

Fixes #21430

s1na avatar Aug 01 '22 12:08 s1na

I implemented a different approach we roughly discussed with @fjl:

the rpc handler will run the method in a goroutine now. It waits for that to finish OR the context deadline to expire, in which case returns an error. The method will be running in background until it stops (possible leak). Therefore it's important that long-running methods respect context deadline. I made sure this happens in getLogs which received most of the tickets of this issue.

The timeout is being set in the outer layer (node/rpcstack). Alternatively we can pass it through to handler for it to set the timeout.

Note: The error type I added feels awkward between official json-rpc errors.

s1na avatar Aug 09 '22 15:08 s1na

This seems to work well. With the test-method like this:

diff --git a/internal/ethapi/api.go b/internal/ethapi/api.go
index 90322033b9..b90ac5048d 100644
--- a/internal/ethapi/api.go
+++ b/internal/ethapi/api.go
@@ -2035,3 +2035,10 @@ func toHexSlice(b [][]byte) []string {
 	}
 	return r
 }
+
+func (api *DebugAPI) TimeMeOut(ctx context.Context, seconds uint64) (string, error) {
+	log.Info("TimeMeOut sleeping...", "seconds", seconds)
+	time.Sleep(time.Second * time.Duration(seconds))
+	log.Info("TimeMeOut waking up!")
+	return "Oll korrekt!", nil
+}

The 25s works fine, but 35s times out

[user@work go-ethereum]$ curl --data '{"method":"debug_timeMeOut","params":[25],"id":1,"jsonrpc":"2.0"}' -H "Content-Type: application/json" -X POST localhost:8545
{"jsonrpc":"2.0","id":1,"result":"Oll korrekt!"}
[user@work go-ethereum]$ curl --data '{"method":"debug_timeMeOut","params":[35],"id":1,"jsonrpc":"2.0"}' -H "Content-Type: application/json" -X POST localhost:8545
{"jsonrpc":"2.0","id":1,"error":{"code":408,"message":"request timed out"}}

And the timeout triggers after 30s:

INFO [08-30|18:47:55.105] TimeMeOut sleeping...                    seconds=35
WARN [08-30|18:48:25.056] Served debug_timeMeOut                   conn=127.0.0.1:44346 reqid=1 duration=29.950591423s err="request timed out"
INFO [08-30|18:48:30.106] TimeMeOut waking up! 

holiman avatar Aug 30 '22 16:08 holiman

What I don't like about this PR, is that the RPC handler will keep running in case of timeout, because it executes the RPC call on a background goroutine now.

I agree about this. If you have a DoS-RPC request that you use to crash some remote node (or infura), then this feature right here would make it even easier to cause DoS. You could just sequentially fire and forget, and get a multihreaded impact.

holiman avatar Aug 31 '22 18:08 holiman

TODO Triage discussion: worth doing this or is it dead in the water, due to the inherent limitations of this apprach?

holiman avatar Nov 05 '22 14:11 holiman

Alternative approach:

When handling RPC call, launch a timer using time.AfterFunc(timeout, callback) where the callback will deliver the error response if the response was not sent yet. This can be done using a 1-buffered channel for example.

responseSlot := make(chan struct{}, 1)
responseSlot <- struct{}{}

responded := make(struct{})

timeoutResponseTimer := time.AfterFunc(timeout, func() {
    select {
    case <-responseSlot:
         // The timeout occurred and the method has not responded yet.
         // send the timeout error response
         close(responded)
    case <-responded:
         // The method responded.
    }
})

response := runMethod()
timeoutResponseTimer.Stop()
select {
case <-responseSlot:
     // send the response
      close(responded)
case <-responded:
     // timeout error response was already sent
}

fjl avatar Nov 15 '22 13:11 fjl

Actually, it can also be done with less channels using sync.Once

var respondOnce sync.Once
timeoutResponseTimer := time.AfterFunc(timeout, func() {
    respondOnce.Do(func () {
        // send timeout error
    })
})

response := runMethod()
timeoutResponseTimer.Stop()
respondOnce.Do(func () {
    // send response
})

fjl avatar Nov 15 '22 19:11 fjl

We tried this approach with Felix. Some notes:

  • Now an error is always returned when timeout hits, without launching unnecessary routines 👍
  • Also timeouts are now handled to batch requests too which we had missed before
  • There's a small issue we have to investigate. When timeout happens curl gives a warning curl: (18) transfer closed with outstanding read data remaining
  • Also batch timeout handling got out of hand with mutex, probably try channels

s1na avatar Nov 16 '22 15:11 s1na

This PR closes #26127 (probably)

holiman avatar Nov 17 '22 09:11 holiman

For some reason Go's http client doesn't play nice with how we're handling timeouts. When using ethclient to invoke a method that times out, I get Post "http://127.0.0.1:8545": EOF instead of the timeout response. cURL and JS clients work fine.

s1na avatar Nov 21 '22 11:11 s1na

Aha I found out why one is working and the other not. If request has Accept-Encoding: gzip (default for go client) then it stalls and connection is cut. Going to look into the reason.

Edit: aah might be because we set the Content-Length after encoding to JSON

s1na avatar Nov 23 '22 10:11 s1na

Content-Length was wrong for gzip compressed responses. But after fixing that still the problem is not gone. Requests with Accept-Encoding: gzip that time out still hang. In this case no tcp packet for the response leaves the server at all. It's dropped on the floor somewhere on the post-processing of the response.

s1na avatar Nov 23 '22 15:11 s1na

It works for me now, with this test code:

//go:build ignore
// +build ignore

package main

import (
	"fmt"

	"github.com/ethereum/go-ethereum/rpc"
)

func main() {
	client, _ := rpc.Dial("http://127.0.0.1:8545")

	fmt.Println("regular call")
	var result string
	err := client.Call(&result, "eth_blockNumber")
	if err != nil {
		fmt.Println("err:", err)
	} else {
		fmt.Println("result:", result)
	}

	fmt.Println("batch call")
	batch := []rpc.BatchElem{
		{
			Method: "eth_chainId",
			Result: new(string),
		},
		{
			Method: "eth_blockNumber",
			Result: new(string),
		},
	}
	err = client.BatchCall(batch)
	if err != nil {
		fmt.Println("err:", err)
	} else {
		for _, elem := range batch {
			if elem.Error != nil {
				fmt.Println(elem.Method+" error:", elem.Error)
			} else {
				fmt.Println(elem.Method+" result:", elem.Result)
			}
		}
	}
}

With the PR I get:

>> go run fjldev/write-timeout.go 
regular call
err: request timed out
batch call
eth_chainId result: 0x1400018c090
eth_blockNumber error: request timed out

On master branch, I get:

~/d/e/s/g/e/go-ethereum >> go run fjldev/write-timeout.go 
regular call
err: Post "http://127.0.0.1:8545": EOF
batch call
err: Post "http://127.0.0.1:8545": EOF

Note this is with the following diff applied:

diff --git a/internal/ethapi/api.go b/internal/ethapi/api.go
index ea0cbfe867..9a3abf120d 100644
--- a/internal/ethapi/api.go
+++ b/internal/ethapi/api.go
@@ -622,6 +622,8 @@ func (api *BlockChainAPI) ChainId() *hexutil.Big {
 
 // BlockNumber returns the block number of the chain head.
 func (s *BlockChainAPI) BlockNumber() hexutil.Uint64 {
+	time.Sleep(3 * time.Second)
+
 	header, _ := s.b.HeaderByNumber(context.Background(), rpc.LatestBlockNumber) // latest header should always be available
 	return hexutil.Uint64(header.Number.Uint64())
 }
diff --git a/rpc/http.go b/rpc/http.go
index 0ba6588f99..2ed5176a39 100644
--- a/rpc/http.go
+++ b/rpc/http.go
@@ -114,7 +114,7 @@ type HTTPTimeouts struct {
 var DefaultHTTPTimeouts = HTTPTimeouts{
 	ReadTimeout:       30 * time.Second,
 	ReadHeaderTimeout: 30 * time.Second,
-	WriteTimeout:      30 * time.Second,
+	WriteTimeout:      1 * time.Second,
 	IdleTimeout:       120 * time.Second,
 }

fjl avatar Nov 23 '22 19:11 fjl

There is still one unresolved issue with this PR, and it's that the server outputs incomplete chunked-transfer encoding in case of timeout when gzip is enabled. Apparently, the Go HTTP client doesn't care. curl warns about it:

curl: (18) transfer closed with outstanding read data remaining

In linked issue #26127, the Python http client raised an exception about it. The problem with implementing it correctly, is that we'd need to compress the entire response to get the Content-Length. For now, we skipped doing this because we don't want to buffer large compressed responses in memory. It is much better to just write them out chunked.

Another way of handling this could be disabling compression for error responses somehow.

fjl avatar Nov 23 '22 20:11 fjl

We ended up finding a way to disable chunked transfer encoding, so this is now finally working!

fjl avatar Nov 24 '22 15:11 fjl

Usign the tests from https://github.com/ethereum/go-ethereum/issues/21430#issuecomment-853695381:

[user@work go-ethereum]$ curl --data '{"method":"debug_timeMeOut","params":[25],"id":1,"jsonrpc":"2.0"}' -H "Content-Type: application/json" -X POST localhost:8545
{"jsonrpc":"2.0","id":1,"result":"Oll korrekt!"}
[user@work go-ethereum]$ curl --data '{"method":"debug_timeMeOut","params":[30],"id":1,"jsonrpc":"2.0"}' -H "Content-Type: application/json" -X POST localhost:8545
{"jsonrpc":"2.0","id":1,"error":{"code":408,"message":"request timed out"}}

LGTM!

holiman avatar Nov 29 '22 10:11 holiman

I'm going to revert #26116 in this PR since the HTTP timeout cuts connection on graphql anyway. Unfortunately there's no error message in this case either:

❯ curl -X POST -H "Content-Type: application/json" --data '{"query":"query{block() { number }}"}' --output - http://127.0.0.1:8545/graphql
curl: (52) Empty reply from server

Edit: no it's good to keep that since it will cancel the graphql query processing. I will only bring down timeout to 30s to be inline with http write timeout.

s1na avatar Nov 29 '22 12:11 s1na

Graphql also returns an error now. Note it's a plain error with status internal server error. Similar to how we did it for json marshalling error:

❯ curl -X POST -H "Content-Type: application/json" --data '{"query":"query{block() { number }}"}' -v http://127.0.0.1:8545/graphql
Note: Unnecessary use of -X or --request, POST is already inferred.
*   Trying 127.0.0.1:8545...
* Connected to 127.0.0.1 (127.0.0.1) port 8545 (#0)
> POST /graphql HTTP/1.1
> Host: 127.0.0.1:8545
> User-Agent: curl/7.79.1
> Accept: */*
> Content-Type: application/json
> Content-Length: 37
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 500 Internal Server Error
< Content-Length: 17
< Date: Tue, 29 Nov 2022 14:08:54 GMT
< Content-Type: text/plain; charset=utf-8
<
* Connection #0 to host 127.0.0.1 left intact
request timed out

s1na avatar Nov 29 '22 14:11 s1na

Can't we return a GraphQL error response?

fjl avatar Nov 30 '22 11:11 fjl

I have decided to remove changes in eth/filters and eth/tracers from the PR. We can submit them in a separate change.

fjl avatar Dec 06 '22 20:12 fjl

RPC method handler changes resubmitted in https://github.com/ethereum/go-ethereum/pull/26320

fjl avatar Dec 06 '22 20:12 fjl