go-ethereum
go-ethereum copied to clipboard
node: rpc write timeout work-around
A work-around for https://github.com/golang/go/issues/47229. Trigger timeout a bit before stdlib closes the connection so users see a proper error message. Note I still kept the WriteTimeout of the http.Server.
I'm not sure what happens if the method handler and the timeout statement both write at the same time
Fixes #21430
I implemented a different approach we roughly discussed with @fjl:
the rpc handler will run the method in a goroutine now. It waits for that to finish OR the context deadline to expire, in which case returns an error. The method will be running in background until it stops (possible leak). Therefore it's important that long-running methods respect context deadline. I made sure this happens in getLogs
which received most of the tickets of this issue.
The timeout is being set in the outer layer (node/rpcstack
). Alternatively we can pass it through to handler for it to set the timeout.
Note: The error type I added feels awkward between official json-rpc errors.
This seems to work well. With the test-method like this:
diff --git a/internal/ethapi/api.go b/internal/ethapi/api.go
index 90322033b9..b90ac5048d 100644
--- a/internal/ethapi/api.go
+++ b/internal/ethapi/api.go
@@ -2035,3 +2035,10 @@ func toHexSlice(b [][]byte) []string {
}
return r
}
+
+func (api *DebugAPI) TimeMeOut(ctx context.Context, seconds uint64) (string, error) {
+ log.Info("TimeMeOut sleeping...", "seconds", seconds)
+ time.Sleep(time.Second * time.Duration(seconds))
+ log.Info("TimeMeOut waking up!")
+ return "Oll korrekt!", nil
+}
The 25s
works fine, but 35s
times out
[user@work go-ethereum]$ curl --data '{"method":"debug_timeMeOut","params":[25],"id":1,"jsonrpc":"2.0"}' -H "Content-Type: application/json" -X POST localhost:8545
{"jsonrpc":"2.0","id":1,"result":"Oll korrekt!"}
[user@work go-ethereum]$ curl --data '{"method":"debug_timeMeOut","params":[35],"id":1,"jsonrpc":"2.0"}' -H "Content-Type: application/json" -X POST localhost:8545
{"jsonrpc":"2.0","id":1,"error":{"code":408,"message":"request timed out"}}
And the timeout triggers after 30s
:
INFO [08-30|18:47:55.105] TimeMeOut sleeping... seconds=35
WARN [08-30|18:48:25.056] Served debug_timeMeOut conn=127.0.0.1:44346 reqid=1 duration=29.950591423s err="request timed out"
INFO [08-30|18:48:30.106] TimeMeOut waking up!
What I don't like about this PR, is that the RPC handler will keep running in case of timeout, because it executes the RPC call on a background goroutine now.
I agree about this. If you have a DoS-RPC request that you use to crash some remote node (or infura), then this feature right here would make it even easier to cause DoS. You could just sequentially fire and forget, and get a multihreaded impact.
TODO Triage discussion: worth doing this or is it dead in the water, due to the inherent limitations of this apprach?
Alternative approach:
When handling RPC call, launch a timer using time.AfterFunc(timeout, callback)
where the callback will deliver the error response if the response was not sent yet. This can be done using a 1-buffered channel for example.
responseSlot := make(chan struct{}, 1)
responseSlot <- struct{}{}
responded := make(struct{})
timeoutResponseTimer := time.AfterFunc(timeout, func() {
select {
case <-responseSlot:
// The timeout occurred and the method has not responded yet.
// send the timeout error response
close(responded)
case <-responded:
// The method responded.
}
})
response := runMethod()
timeoutResponseTimer.Stop()
select {
case <-responseSlot:
// send the response
close(responded)
case <-responded:
// timeout error response was already sent
}
Actually, it can also be done with less channels using sync.Once
var respondOnce sync.Once
timeoutResponseTimer := time.AfterFunc(timeout, func() {
respondOnce.Do(func () {
// send timeout error
})
})
response := runMethod()
timeoutResponseTimer.Stop()
respondOnce.Do(func () {
// send response
})
We tried this approach with Felix. Some notes:
- Now an error is always returned when timeout hits, without launching unnecessary routines 👍
- Also timeouts are now handled to batch requests too which we had missed before
- There's a small issue we have to investigate. When timeout happens curl gives a warning
curl: (18) transfer closed with outstanding read data remaining
- Also batch timeout handling got out of hand with mutex, probably try channels
This PR closes #26127 (probably)
For some reason Go's http client doesn't play nice with how we're handling timeouts. When using ethclient
to invoke a method that times out, I get Post "http://127.0.0.1:8545": EOF
instead of the timeout response. cURL and JS clients work fine.
Aha I found out why one is working and the other not. If request has Accept-Encoding: gzip
(default for go client) then it stalls and connection is cut. Going to look into the reason.
Edit: aah might be because we set the Content-Length
after encoding to JSON
Content-Length
was wrong for gzip compressed responses. But after fixing that still the problem is not gone. Requests with Accept-Encoding: gzip
that time out still hang. In this case no tcp packet for the response leaves the server at all. It's dropped on the floor somewhere on the post-processing of the response.
It works for me now, with this test code:
//go:build ignore
// +build ignore
package main
import (
"fmt"
"github.com/ethereum/go-ethereum/rpc"
)
func main() {
client, _ := rpc.Dial("http://127.0.0.1:8545")
fmt.Println("regular call")
var result string
err := client.Call(&result, "eth_blockNumber")
if err != nil {
fmt.Println("err:", err)
} else {
fmt.Println("result:", result)
}
fmt.Println("batch call")
batch := []rpc.BatchElem{
{
Method: "eth_chainId",
Result: new(string),
},
{
Method: "eth_blockNumber",
Result: new(string),
},
}
err = client.BatchCall(batch)
if err != nil {
fmt.Println("err:", err)
} else {
for _, elem := range batch {
if elem.Error != nil {
fmt.Println(elem.Method+" error:", elem.Error)
} else {
fmt.Println(elem.Method+" result:", elem.Result)
}
}
}
}
With the PR I get:
>> go run fjldev/write-timeout.go
regular call
err: request timed out
batch call
eth_chainId result: 0x1400018c090
eth_blockNumber error: request timed out
On master branch, I get:
~/d/e/s/g/e/go-ethereum >> go run fjldev/write-timeout.go
regular call
err: Post "http://127.0.0.1:8545": EOF
batch call
err: Post "http://127.0.0.1:8545": EOF
Note this is with the following diff applied:
diff --git a/internal/ethapi/api.go b/internal/ethapi/api.go
index ea0cbfe867..9a3abf120d 100644
--- a/internal/ethapi/api.go
+++ b/internal/ethapi/api.go
@@ -622,6 +622,8 @@ func (api *BlockChainAPI) ChainId() *hexutil.Big {
// BlockNumber returns the block number of the chain head.
func (s *BlockChainAPI) BlockNumber() hexutil.Uint64 {
+ time.Sleep(3 * time.Second)
+
header, _ := s.b.HeaderByNumber(context.Background(), rpc.LatestBlockNumber) // latest header should always be available
return hexutil.Uint64(header.Number.Uint64())
}
diff --git a/rpc/http.go b/rpc/http.go
index 0ba6588f99..2ed5176a39 100644
--- a/rpc/http.go
+++ b/rpc/http.go
@@ -114,7 +114,7 @@ type HTTPTimeouts struct {
var DefaultHTTPTimeouts = HTTPTimeouts{
ReadTimeout: 30 * time.Second,
ReadHeaderTimeout: 30 * time.Second,
- WriteTimeout: 30 * time.Second,
+ WriteTimeout: 1 * time.Second,
IdleTimeout: 120 * time.Second,
}
There is still one unresolved issue with this PR, and it's that the server outputs incomplete chunked-transfer encoding in case of timeout when gzip is enabled. Apparently, the Go HTTP client doesn't care. curl warns about it:
curl: (18) transfer closed with outstanding read data remaining
In linked issue #26127, the Python http client raised an exception about it. The problem with implementing it correctly, is that we'd need to compress the entire response to get the Content-Length. For now, we skipped doing this because we don't want to buffer large compressed responses in memory. It is much better to just write them out chunked.
Another way of handling this could be disabling compression for error responses somehow.
We ended up finding a way to disable chunked transfer encoding, so this is now finally working!
Usign the tests from https://github.com/ethereum/go-ethereum/issues/21430#issuecomment-853695381:
[user@work go-ethereum]$ curl --data '{"method":"debug_timeMeOut","params":[25],"id":1,"jsonrpc":"2.0"}' -H "Content-Type: application/json" -X POST localhost:8545
{"jsonrpc":"2.0","id":1,"result":"Oll korrekt!"}
[user@work go-ethereum]$ curl --data '{"method":"debug_timeMeOut","params":[30],"id":1,"jsonrpc":"2.0"}' -H "Content-Type: application/json" -X POST localhost:8545
{"jsonrpc":"2.0","id":1,"error":{"code":408,"message":"request timed out"}}
LGTM!
I'm going to revert #26116 in this PR since the HTTP timeout cuts connection on graphql anyway. Unfortunately there's no error message in this case either:
❯ curl -X POST -H "Content-Type: application/json" --data '{"query":"query{block() { number }}"}' --output - http://127.0.0.1:8545/graphql
curl: (52) Empty reply from server
Edit: no it's good to keep that since it will cancel the graphql query processing. I will only bring down timeout to 30s to be inline with http write timeout.
Graphql also returns an error now. Note it's a plain error with status internal server error. Similar to how we did it for json marshalling error:
❯ curl -X POST -H "Content-Type: application/json" --data '{"query":"query{block() { number }}"}' -v http://127.0.0.1:8545/graphql
Note: Unnecessary use of -X or --request, POST is already inferred.
* Trying 127.0.0.1:8545...
* Connected to 127.0.0.1 (127.0.0.1) port 8545 (#0)
> POST /graphql HTTP/1.1
> Host: 127.0.0.1:8545
> User-Agent: curl/7.79.1
> Accept: */*
> Content-Type: application/json
> Content-Length: 37
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 500 Internal Server Error
< Content-Length: 17
< Date: Tue, 29 Nov 2022 14:08:54 GMT
< Content-Type: text/plain; charset=utf-8
<
* Connection #0 to host 127.0.0.1 left intact
request timed out
Can't we return a GraphQL error response?
I have decided to remove changes in eth/filters and eth/tracers from the PR. We can submit them in a separate change.
RPC method handler changes resubmitted in https://github.com/ethereum/go-ethereum/pull/26320