tfchain icon indicating copy to clipboard operation
tfchain copied to clipboard

Devnet light public node issues after it being offline for a while

Open coesensbert opened this issue 2 years ago • 10 comments

Thursday (01/12) evening until Friday morning the public devnet light node was offline for about 16h due to dc networking issues. It kept running but was unreachable for anyone/anything.

Since it came back online there are lots of open TCP connections to the node. image

root@tfchain-dev-pub-light01:~# lsof -n -p 70767 | wc -l
18934

image

We saw this before on mainnet with the gridproxy issues, but this is different. These connections come from many different ip's. Lee extracted a list of the IP's with their amount of open connections:

    74 2600:1700:c1e0:1150::2c
    176 2600:1700:c1e0:1150::49
    561 2600:1700:c1e0:1150:7c7a:cfff:fe76:d51f
    559 2600:1700:c1e0:1150:d448:2dff:fe8e:2eb8
    207 2a02:1802:5e:16:27f5:1282:21b5:a356
     39 2a02:1802:5e:16:7cea:e670:64e6:acac
    211 2a02:1802:5e:16:b618:3cc8:73f0:a901
    202 2a02:1802:5e:16:b91:530f:7225:ad91
    204 2a10:b600:0:9:23d1:b915:c8db:8cd1
    215 2a10:b600:0:9:2db1:6689:c235:53ac
    273 2a10:b600:0:9:34fd:b37e:92d8:1b8d
    209 2a10:b600:0:9:4981:33c:ef40:d05a
    211 2a10:b600:0:9:5f1f:2754:3cd0:d098
    208 2a10:b600:0:9:77ca:b3c0:459e:cb5e
    209 2a10:b600:0:9:8f18:d624:13c6:d404
    208 2a10:b600:0:9:b08e:cf24:2af1:198f
    210 2a10:b600:0:9:c6e7:862b:891c:7d15
    210 2a10:b600:0:9:d407:accb:cad9:9add
    210 2a10:b600:0:9:db91:142f:a8ba:5a1d
    208 2a10:b600:0:9:ddc5:f3de:e0c2:a865
    207 2a10:b600:0:9:f404:31b8:30a7:8416
    208 2a10:b600:0:9:f7b1:6368:f63b:6971
      1 2a10:b600:0:be77:5213:ad41:aba2:8d38
      1 2a10:b600:0:be77:6550:4bf9:3ffe:fdac
    762 2a10:b600:1:0:1459:4bff:fe15:966a
    758 2a10:b600:1:0:149b:c2ff:fe41:d0b9
    748 2a10:b600:1:0:40f9:b5ff:fe38:6188
    760 2a10:b600:1:0:415:f1ff:fe0f:c9b1
    738 2a10:b600:1:0:4ce9:edff:fe24:39c1
    755 2a10:b600:1:0:60c9:b1ff:fe70:7e32
    755 2a10:b600:1:0:8aa:abff:fe74:6ff5
    751 2a10:b600:1:0:a832:1ff:fe2d:93c
    757 2a10:b600:1:0:b40b:faff:fedb:7a7c
    761 2a10:b600:1:0:b868:4ff:fe50:ccf5
    757 2a10:b600:1:0:bc70:e3ff:fe9a:18b7
    758 2a10:b600:1:0:ccc1:81ff:fef4:c9ff
    755 2a10:b600:1:0:d0e4:afff:feb0:7599
    760 2a10:b600:1:0:d462:73ff:fef0:e6dd
    679 2a10:b600:1:0:d8fd:3fff:fee5:f485
    761 2a10:b600:1:0:f823:1ff:fe0c:66f2
    766 2a10:b600:1:0:f82b:68ff:fea7:f306
    762 2a10:b600:1:0:f833:d4ff:feb4:cf7d

At that time there were ofcourse lots of connectivity error's towards tfchain.dev.grid.tf: https://mon.grid.tf/explore?orgId=1&left=%5B%221669914000000%22,%221669915859000%22,%22Loki%22,%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnetwork%3D%5C%22development%5C%22%7D%22%7D%5D

Once the node was online again the error's stopped, but the amount of connections rose dramatically: https://mon.grid.tf/explore?orgId=1&left=%5B%221669971600000%22,%221669975200000%22,%22Loki%22,%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnetwork%3D%5C%22development%5C%22%7D%22%7D%5D

Since it only keeps the last 1000 blocks it might be the reason for the current situation. ZOS maybe trying to fetch blocks that are already gone from the light node?

coesensbert avatar Dec 05 '22 12:12 coesensbert