Node stopped synchronizing after usage of the new docker image
Hi @maebeam, I'm back with server stopped synchronisation.
After yesterday updated of the docker image to "19c446547510f1e0b83d56611e732e3fa6a0b32d" server stopped syncing, I can see last post 12h ago. It has also started ignoring super_admins, at the moment I can see only new post but not server stats ( I was able to for some time yesterday).
Docker compose file:
version: "3.7"
services:
backend:
container_name: backend
image: docker.io/bitclout/backend:19c446547510f1e0b83d56611e732e3fa6a0b32d
command: run
volumes:
- db:/db
- ./:/bitclout/run
ports:
- 17001:17001
- 17000:17000
env_file:
- dev.env
expose:
- "17001"
- "17000"
frontend:
container_name: frontend
image: docker.io/bitclout/frontend:23d22a586e70b2f6700f01ab4feabe98e53ea991
ports:
- 8080:8080
volumes:
- ./:/app
env_file:
- dev.env
expose:
- "8080"
nginx:
container_name: nginx
image: nginx:latest
command: "/bin/sh -c 'while :; do sleep 6h & wait $${!}; nginx -s reload; done & nginx -g \"daemon off;\"'"
volumes:
- ./nginx.dev:/etc/nginx/nginx.conf
- ./data/certbot/conf:/etc/letsencrypt
- ./data/certbot/www:/var/www/certbot
depends_on:
- backend
- frontend
ports:
- 80:80
- 443:443
certbot:
image: certbot/certbot
entrypoint: "/bin/sh -c 'trap exit TERM; while :; do certbot renew; sleep 12h & wait $${!}; done;'"
volumes:
- ./data/certbot/conf:/etc/letsencrypt
- ./data/certbot/www:/var/www/certbot
volumes:
db:
Full logs attached. full.log
@MirekR anything in dmesg like running out of memory, too many open files or disc space?
Main reason i have seen nodes crash or stop syncing is for those 2
Ah looking at your logs you did not clear the data in /db volume. As per the changelog on backend you need to clear and resync.
Please reopen this if wiping the /db volume does not resolve.
I did clear the db and I am stuck, should we re-open?
Sure
We shouldn't need to wipe /db volume everytime change comes in, it kills the global feed as well.
@MikekR We haven't had to - before NFTs mine has not had to resync for 2 months.
Also the main issue with resync is not so much the blocks - they happen relatively quickly.
Its the TxIndex which is getting replaced by new DB soon that should make it all much faster and easier.
Performance of re-sync is one question, loosing nodes global feed is another and from the end user perspective more serious issue.
You can keep the global feed in 2 ways:
1 - dont delete the whole db - eg make sure you keep the one in /db/badgerdb/globalstate or
2 - use the config option to load global state from a central node
# The IP:PORT or DOMAIN:PORT corresponding to a node that can be used to
# set/get global state. When this is not provided, global state is set/fetched
# from a local DB. Global state is used to manage things like user data, e.g.
# emails, that should not be duplicated across multiple nodes.
GLOBAL_STATE_REMOTE_NODE=
Its the TxIndex which is getting replaced by new DB soon that should make it all much faster and easier.
Do I read this correctly as "we'll need to re-sync again soon and loose global again"?
@MirekR Im sure the upgrade can be done in such a way that you keep global state.
And note my reply above about ways to not loose global state.
We're in the process of the backing store to postgres which will make these updates much less painful and more efficient. Global state migration tools will be provided.
@maebeam may I ask where we are with this?
And... has anyone considered a postgres docker image for a faster sync - at least at some set block height?
I was having this reorg issue as recently as last weekend (9/4/21) - https://github.com/bitclout/core/issues/98
Postgres is in beta but not ready to replace badger yet. What errors are you seeing? I haven’t seen any other reports of this issue recently
Sorry for the delay... It's taking a long time to get anywhere - after a complete rebuild and resync.
I get this often - but it's understandable... and at least still moving forward.
E0912 07:27:06.327239 1 peer.go:142] AddBitCloutMessage: Not enqueueing message GET_BLOCKS because peer is disconnecting
I0912 07:27:06.327286 1 server.go:1131] Server._handleBlock: Received block ( 11278 / 59719 ) from Peer [ Remote Address: 35.232.92.5:17000, DISCONNECTED PeerID=2 ]
It's been 7 hours and still only at 11k - but like I said it's moving.
Plus it turns out I was getting the "reorg" error loop at around 9427 - so at least we are beyond that:
I0906 21:14:55.712786 1 server.go:1131] Server._handleBlock: Received block ( 9427 / 58162 ) from Peer [ Remote Address: 35.232.92.5:17000, DISCONNECTED PeerID=20 ]
E0906 21:14:55.733306 1 server.go:1125] Server._handleBlock: Encountered an error processing block <Header: < 9427, 00000000009c5230c9fbce0c6369633d751a445f9bc35d3448390821ae7eb2dd, 0 >, Signer Key: NONE>. Disconnecting from peer [ Remote Address: 35.232.92.5:17000, DISCONNECTED PeerID=20 ]: Error while processing block: : ProcessBlock: Problem fetching block (< TstampSecs: 1616574709, Height: 9313, Hash: 00000000003f6fa554fb4d68eddf9b5c55809bf419c7c458ead7625d4e759d2f, ParentHash 00000000001a0d631b5a863482e1857f5d96aeda65250f2486824e73b57af068, Status: HEADER_VALIDATED | BLOCK_PROCESSED | BLOCK_STORED, CumWork: 8311532279091880>) during attach in reorg: Key not found
I'll keep it going and see what develops
Do you have a fast and reliable internet connection? I've only seen this type of behavior on low bandwidth / spotty internet.
I have a fast connection, but also this is running on google cloud platform
e2-standard-8 (8 vCPUs, 32 GB memory) ubuntu 20.04 with static IP
Following up
- started totally new on a new server instance
- readonly set to true + admin and super-admin keys set (similar to before)
- the sync speed was improved - got to 10k mark within 2 hours.
- ran into the reorg issue at 18k - after a reboot (during the sync of course). It just kept looping over a few blocks with failure... disconnecting...
- completely started over using:
docker-compose down -f file.yml --remove-orphans --volumes; docker image purge -a; ./run.sh -d - left it alone and now fully sync'd - in less than 12 hours
Very weird. You may have faster syncing luck using larger SSD volumes. IOPS are determined by disk size on GCP
lol. I thought 12 was fast, but even that is a barrier tbf.
This latest setup was on a 250 GB SSD - but I hadn't considered the IOPS rate. Good point
One day when nodes can go from 0 to sync in 5 minutes, devs will be able to focus on building a business instead of maintaining a node. And I'm sure us early adopters will profit by extension :smiley_cat: