run icon indicating copy to clipboard operation
run copied to clipboard

Node stopped synchronizing after usage of the new docker image

Open MirekR opened this issue 4 years ago • 20 comments

Hi @maebeam, I'm back with server stopped synchronisation.

After yesterday updated of the docker image to "19c446547510f1e0b83d56611e732e3fa6a0b32d" server stopped syncing, I can see last post 12h ago. It has also started ignoring super_admins, at the moment I can see only new post but not server stats ( I was able to for some time yesterday).

Docker compose file:

version: "3.7"
services:
  backend:
    container_name: backend
    image: docker.io/bitclout/backend:19c446547510f1e0b83d56611e732e3fa6a0b32d
    command: run
    volumes:
    - db:/db
    - ./:/bitclout/run  
    ports:
    - 17001:17001
    - 17000:17000
    env_file:
    - dev.env
    expose:
    - "17001"
    - "17000"
  frontend:
    container_name: frontend 
    image: docker.io/bitclout/frontend:23d22a586e70b2f6700f01ab4feabe98e53ea991
    ports:
    - 8080:8080
    volumes:
    - ./:/app
    env_file:
    - dev.env
    expose:
    - "8080"
  nginx: 
    container_name: nginx
    image: nginx:latest
    command: "/bin/sh -c 'while :; do sleep 6h & wait $${!}; nginx -s reload; done & nginx -g \"daemon off;\"'"
    volumes:
      - ./nginx.dev:/etc/nginx/nginx.conf
      - ./data/certbot/conf:/etc/letsencrypt
      - ./data/certbot/www:/var/www/certbot
    depends_on: 
      - backend
      - frontend
    ports:
      - 80:80
      - 443:443
  certbot:
    image: certbot/certbot
    entrypoint: "/bin/sh -c 'trap exit TERM; while :; do certbot renew; sleep 12h & wait $${!}; done;'"
    volumes:
      - ./data/certbot/conf:/etc/letsencrypt
      - ./data/certbot/www:/var/www/certbot
volumes:
  db:

Full logs attached. full.log

MirekR avatar Jul 30 '21 10:07 MirekR

@MirekR anything in dmesg like running out of memory, too many open files or disc space?

Main reason i have seen nodes crash or stop syncing is for those 2

tijno avatar Jul 30 '21 14:07 tijno

Ah looking at your logs you did not clear the data in /db volume. As per the changelog on backend you need to clear and resync.

tijno avatar Jul 30 '21 15:07 tijno

Please reopen this if wiping the /db volume does not resolve.

maebeam avatar Jul 30 '21 17:07 maebeam

I did clear the db and I am stuck, should we re-open?

marnimelrose avatar Jul 30 '21 17:07 marnimelrose

Sure

maebeam avatar Jul 30 '21 17:07 maebeam

We shouldn't need to wipe /db volume everytime change comes in, it kills the global feed as well.

MirekR avatar Aug 02 '21 08:08 MirekR

@MikekR We haven't had to - before NFTs mine has not had to resync for 2 months.

Also the main issue with resync is not so much the blocks - they happen relatively quickly.

Its the TxIndex which is getting replaced by new DB soon that should make it all much faster and easier.

tijno avatar Aug 02 '21 15:08 tijno

Performance of re-sync is one question, loosing nodes global feed is another and from the end user perspective more serious issue.

MirekR avatar Aug 02 '21 16:08 MirekR

You can keep the global feed in 2 ways:

1 - dont delete the whole db - eg make sure you keep the one in /db/badgerdb/globalstate or

2 - use the config option to load global state from a central node

# The IP:PORT or DOMAIN:PORT corresponding to a node that can be used to
# set/get global state. When this is not provided, global state is set/fetched
# from a local DB. Global state is used to manage things like user data, e.g.
# emails, that should not be duplicated across multiple nodes.
GLOBAL_STATE_REMOTE_NODE=

tijno avatar Aug 02 '21 16:08 tijno

Its the TxIndex which is getting replaced by new DB soon that should make it all much faster and easier.

Do I read this correctly as "we'll need to re-sync again soon and loose global again"?

MirekR avatar Aug 02 '21 16:08 MirekR

@MirekR Im sure the upgrade can be done in such a way that you keep global state.

And note my reply above about ways to not loose global state.

tijno avatar Aug 02 '21 16:08 tijno

We're in the process of the backing store to postgres which will make these updates much less painful and more efficient. Global state migration tools will be provided.

maebeam avatar Aug 02 '21 16:08 maebeam

@maebeam may I ask where we are with this?

And... has anyone considered a postgres docker image for a faster sync - at least at some set block height?

I was having this reorg issue as recently as last weekend (9/4/21) - https://github.com/bitclout/core/issues/98

addanus avatar Sep 11 '21 03:09 addanus

Postgres is in beta but not ready to replace badger yet. What errors are you seeing? I haven’t seen any other reports of this issue recently

maebeam avatar Sep 11 '21 22:09 maebeam

Sorry for the delay... It's taking a long time to get anywhere - after a complete rebuild and resync.

I get this often - but it's understandable... and at least still moving forward.

E0912 07:27:06.327239 1 peer.go:142] AddBitCloutMessage: Not enqueueing message GET_BLOCKS because peer is disconnecting

I0912 07:27:06.327286 1 server.go:1131] Server._handleBlock: Received block ( 11278 / 59719 ) from Peer [ Remote Address: 35.232.92.5:17000, DISCONNECTED PeerID=2 ]

It's been 7 hours and still only at 11k - but like I said it's moving.

Plus it turns out I was getting the "reorg" error loop at around 9427 - so at least we are beyond that:

I0906 21:14:55.712786 1 server.go:1131] Server._handleBlock: Received block ( 9427 / 58162 ) from Peer [ Remote Address: 35.232.92.5:17000, DISCONNECTED PeerID=20 ]

E0906 21:14:55.733306 1 server.go:1125] Server._handleBlock: Encountered an error processing block <Header: < 9427, 00000000009c5230c9fbce0c6369633d751a445f9bc35d3448390821ae7eb2dd, 0 >, Signer Key: NONE>. Disconnecting from peer [ Remote Address: 35.232.92.5:17000, DISCONNECTED PeerID=20 ]: Error while processing block: : ProcessBlock: Problem fetching block (< TstampSecs: 1616574709, Height: 9313, Hash: 00000000003f6fa554fb4d68eddf9b5c55809bf419c7c458ead7625d4e759d2f, ParentHash 00000000001a0d631b5a863482e1857f5d96aeda65250f2486824e73b57af068, Status: HEADER_VALIDATED | BLOCK_PROCESSED | BLOCK_STORED, CumWork: 8311532279091880>) during attach in reorg: Key not found

I'll keep it going and see what develops

addanus avatar Sep 12 '21 07:09 addanus

Do you have a fast and reliable internet connection? I've only seen this type of behavior on low bandwidth / spotty internet.

maebeam avatar Sep 12 '21 08:09 maebeam

I have a fast connection, but also this is running on google cloud platform

e2-standard-8 (8 vCPUs, 32 GB memory) ubuntu 20.04 with static IP

addanus avatar Sep 12 '21 08:09 addanus

Following up

  1. started totally new on a new server instance
  2. readonly set to true + admin and super-admin keys set (similar to before)
  3. the sync speed was improved - got to 10k mark within 2 hours.
  4. ran into the reorg issue at 18k - after a reboot (during the sync of course). It just kept looping over a few blocks with failure... disconnecting...
  5. completely started over using: docker-compose down -f file.yml --remove-orphans --volumes; docker image purge -a; ./run.sh -d
  6. left it alone and now fully sync'd - in less than 12 hours

addanus avatar Sep 13 '21 16:09 addanus

Very weird. You may have faster syncing luck using larger SSD volumes. IOPS are determined by disk size on GCP

maebeam avatar Sep 14 '21 13:09 maebeam

lol. I thought 12 was fast, but even that is a barrier tbf.

This latest setup was on a 250 GB SSD - but I hadn't considered the IOPS rate. Good point

One day when nodes can go from 0 to sync in 5 minutes, devs will be able to focus on building a business instead of maintaining a node. And I'm sure us early adopters will profit by extension :smiley_cat:

addanus avatar Sep 14 '21 14:09 addanus