cronos
cronos copied to clipboard
can‘t connect to ws server in v0.7.0
Env
cronos: 0.6.5 and 0.7.0
Issue
Can't subscribe newHead and newTxs:
SubscribeNewHeads / newTxs from golang
dial tcp <my_ip>:8546: connect: connection reset by peer ws://<my_ip>:8546
Subscribe from wscat
wscat -c ws://<my_ip>:8546
error: connect ECONNRESET <my_ip>:8546
Behavior in v0.6.5
You can connect to ws after restarting, it works only 5-10 mins, then you get error and need another restart.
Behavior in v0.7.0
Today I upgrade to v0.7.0, the longest record I have is about 1 hour, within that 1 hour, I can subscribe newHead and newTx, I test it many times, so I thought it is been fixed in v0.7.0.
But when I tried again just now, I can't connect ws server anymore. Even restarting cronosd is useless.
The issue is still there.
@yihuang Please help to check, thx.
What's the problem do you think, I am willing to dig into the issue, please share your findings.
My findings
In tendermint/state/txindex/indexer_service.go
use unbuffered channel for blockHead and tx subscribe, which may block the channel
blockHeadersSub, err := is.eventBus.SubscribeUnbuffered(
context.Background(),
subscriber,
types.EventQueryNewBlockHeader)
if err != nil {
return err
}
txsSub, err := is.eventBus.SubscribeUnbuffered(context.Background(), subscriber, types.EventQueryTx)
if err != nil {
return err
}
In tendermint/libs/pubsub/pubsub.go
I do observed send event msg get blocked, the logic goes to -->
mark.
func (state *state) send(msg interface{}, events map[string][]string) error {
for qStr, clientSubscriptions := range state.subscriptions {
q := state.queries[qStr].q
match, err := q.Matches(events)
if err != nil {
return fmt.Errorf("failed to match against query %s: %w", q.String(), err)
}
if match {
for clientID, subscription := range clientSubscriptions {
if cap(subscription.out) == 0 {
// block on unbuffered channel
--> subscription.out <- NewMessage(msg, events)
} else {
// don't block on buffered channels
select {
case subscription.out <- NewMessage(msg, events):
default:
state.remove(clientID, qStr, ErrOutOfCapacity)
}
}
}
}
}
return nil
}
Solution
I changed blockHeadersSub, err := is.eventBus.SubscribeUnbuffered
and txsSub, err := is.eventBus.SubscribeUnbuffered
to buffered channel, so far so good, let me keep observing for a while.
Three days passed, still works. @yihuang
Three days passed, still works. @yihuang
awsome, so the issue is dead lock on unbuffered channel? Can you open a PR to tendermint directly?
I don't know how to effectively reproduce the issue and am not sure if there are other side effects.
Hi @huahuayu, I think the block on unbuffered channel
was designed for the indexer services in Tendermint, it guaranteed that every event will be processed to the indexer. If the indexer has a heavy I/O loading, it will blocks the pubsub module temporarily for sure.
What's your experimental_websocket_write_buffer_size
and experimental_subscription_buffer_size
in config.toml
?
it shouldn't be 0
then you will get a buffered channels
subscription.
Do you need to use indexer
service from the node? maybe you can set it to null
, and to see if this issue still happens.
I think ws server issue will eventually be fixed by this solution:https://github.com/crypto-org-chain/cronos/issues/665