go-ethereum icon indicating copy to clipboard operation
go-ethereum copied to clipboard

"Database compacting, degraded performance database" for 2 days already

Open fer662 opened this issue 4 years ago • 38 comments

My node was already fully synced and working fine, randomly started doing that and it's not finishing or giving any progress.

Geth version: Version: 1.9.25-stable

OS version: Ubuntu 20.04.1 LTS

Parameters: geth --cache 4096 --http --http.api "txpool,personal,db, eth, net, web3" --ws --ws.api "web3, eth" --maxpeers=15 --datadir /mnt/geth --datadir.ancient /mnt/freezer --http.corsdomain * --ws.origins * --rpcvhosts=* --port 2600

/mnt/geth is an SSD, /mnt/freezer is an HDD

Any help is appreciated. I need to get my node working.

fer662 avatar Dec 22 '20 16:12 fer662

I saw a bunch of people in forums with this issue and they ended up deleting chaindata and resyncing, which is something that with my connection would take quite a while. I'm trying the --nocompaction flag now, will report

fer662 avatar Dec 23 '20 14:12 fer662

What kind of IO performance do you have on your SSD and what kind of FileSystem are you running on the SSD? Also it would be great if you could provide some logs. Please also do not use the --nocompaction flag.

ligi avatar Jan 07 '21 09:01 ligi

This issue has been automatically closed because there has been no response to our request for more information from the original author. With only the information that is currently in the issue, we don't have enough information to take action. Please reach out if you have more relevant information or answers to our questions so that we can investigate further.

no-response[bot] avatar Jan 22 '21 18:01 no-response[bot]

hey @fer662! what happened here? did you ever fix the problem?

TennisBowling avatar Feb 13 '21 00:02 TennisBowling

Sadly i had to re sync from scratch. there was no way to resolve it.

On Fri, Feb 12, 2021 at 9:55 PM Enz [email protected] wrote:

hey @fer662 https://github.com/fer662! what happened here? did you ever fix the problem?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ethereum/go-ethereum/issues/22056#issuecomment-778533279, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACR7YYYHL5PZPK3ELJYLYLS6XEXNANCNFSM4VF3Z7BA .

fer662 avatar Feb 13 '21 00:02 fer662

have you finished syncing it?

TennisBowling avatar Feb 13 '21 00:02 TennisBowling

my database was compacting for 10m then continued on it's regular stuff.

TennisBowling avatar Feb 13 '21 00:02 TennisBowling

Yes, it took a reasonable amount of time that i cannot remember exactly and has been running fine since. The hardware is the same, and the issue didn't stem from it for sure.

On Fri, Feb 12, 2021 at 9:57 PM Enz [email protected] wrote:

have you finished syncing it?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ethereum/go-ethereum/issues/22056#issuecomment-778533622, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACR7Y7UID5H3ZWD3IYP4S3S6XE6TANCNFSM4VF3Z7BA .

fer662 avatar Feb 13 '21 00:02 fer662

I recommend closing this issue.

TennisBowling avatar Feb 20 '21 04:02 TennisBowling

Happened to me when I upgrade to Geth 1.10.1 from 1.9.25 on Ropsten. The 1.9.25 was fully synced at time of doing the operation.

I first stopped the running 1.9.25 instance. Took a GCP volume snapshot. Created a new container with a new disk created using the snapshot I just took and started the 1.10.1 instance.

At launch, Geth 1.1.0.1 took around 10m even before initializing the IPC file. While doing this, htop showed heavy disk usage, around 160Mb/s (I don't recall exactly if it was more on the read or the write side).

After ~10m, the node started the IPC, connected to the network and synced around 10 blocks, then entered the database compacting stage. This ran for around 5h until I gave up.

Here the leveldb.stats on the 1.9.25 instance

 Level |   Tables   |    Size(MB)   |    Time(sec)  |    Read(MB)   |   Write(MB)
-------+------------+---------------+---------------+---------------+---------------
   1   |         50 |      99.27759 |      35.00957 |     814.65889 |     771.45813
   2   |        647 |     999.75740 |      46.79872 |    1487.34292 |    1485.75268
   3   |       4844 |    9097.79491 |     325.22367 |    5466.02774 |    4459.89624
   4   |        539 |     400.50786 |       0.00000 |       0.00000 |       0.00000
   5   |      68207 |  136884.92513 |       0.00000 |       0.00000 |       0.00000
-------+------------+---------------+---------------+---------------+---------------
 Total |      74287 |  147482.26288 |     407.03196 |    7768.02954 |    6717.10706

And here the same on the 1.10.1 after running 5 hours (in fact, after having forced kill it):

 Level |   Tables   |    Size(MB)   |    Time(sec)  |    Read(MB)   |   Write(MB)
-------+------------+---------------+---------------+---------------+---------------
   0   |      74352 |  148558.54873 |       0.00000 |       0.00000 |       0.00000
-------+------------+---------------+---------------+---------------+---------------
 Total |      74352 |  148558.54873 |       0.00000 |       0.00000 |       0.00000

Seems it was trying to compact this gigantic amount of L0 tables. The LOG file in the chaindata folder seems to agree:

cat LOG | grep com
23:06:01.461484 table@compaction L0·74352 -> L1·0 S·145GiB Q·1243574076

And for some level of estimation of the time it would have take:

cat LOG | grep com
23:06:01.461484 table@compaction L0·74352 -> L1·0 S·145GiB Q·1243574076
23:08:28.748009 table@build created L1@8383020 N·9976 S·2MiB "\x00\x00\x00..'\xc6P,v268503960":"\x00\x01\x10..\x86t\xd9,v591541816"
...
04:27:32.131617 table@build created L1@8384256 N·9834 S·2MiB

So from my (limited) understanding of levedb, it compacted 1236 tables (8384256 - 8383020) in ~5h so ~247 tables/h and I imagine probably 12 days (74532 / 247 / 24 ~= 12). The compaction read throughput it really really low, like 150Kb/s to 300Kb/s in both read & write. A single CPU core was maxed out at 100% during the whole time.

Now, why I turned having that amount of L0 tables, I have no idea. No sure what happened to cause this. I'll redo a round of restore + restart to see if I can avoid the switch to all L0 tables.

I upgrade other nodes (4 in totals) in-place (i.e. without a volume snapshot + new disk) correctly without problems, not sure what volume snapshot part caused this.

maoueh avatar Mar 10 '21 04:03 maoueh

Another try of restoring from snapshot lead to similar results. I now early stopped the Geth process, then ran geth db stats, which waited around ~8 minutes before returning a result.

I better monitored the disk usage this time, I can say the bulk of I/O was read operation around 360MB/s (which probably tops the cloud provider disk throughput) with some infrequent write operations but with high peaks.

The problem is probably on our side though, I think at least, something must have changed. I'll report if I find what my problem is.

maoueh avatar Mar 10 '21 05:03 maoueh

I just realized that this particular node in the 1.9.25 version as no L0 tables at all, could it be what causes leveldb to return everyone else to L0 when I restored the volume?

The node from which I create the volume 1.9.25:

 Level |   Tables   |    Size(MB)   |    Time(sec)  |    Read(MB)   |   Write(MB)
-------+------------+---------------+---------------+---------------+---------------
   1   |         50 |      99.27759 |      35.00957 |     814.65889 |     771.45813
   2   |        647 |     999.75740 |      46.79872 |    1487.34292 |    1485.75268
   3   |       4844 |    9097.79491 |     325.22367 |    5466.02774 |    4459.89624
   4   |        539 |     400.50786 |       0.00000 |       0.00000 |       0.00000
   5   |      68207 |  136884.92513 |       0.00000 |       0.00000 |       0.00000
-------+------------+---------------+---------------+---------------+---------------
 Total |      74287 |  147482.26288 |     407.03196 |    7768.02954 |    6717.10706

Another different node we have that is now migrated to 1.10.1 (from 1.9.25, in place update):

> debug.chaindbProperty("")
Compactions
 Level |   Tables   |    Size(MB)   |    Time(sec)  |    Read(MB)   |   Write(MB)
-------+------------+---------------+---------------+---------------+---------------
   0   |          2 |     579.18027 |     150.50175 |       0.00000 |    6482.17941
   1   |         52 |      98.23994 |     216.76490 |    9124.62485 |    9057.91622
   2   |        515 |     998.03997 |     232.92075 |    9384.97616 |    9234.87690
   3   |       5041 |    9998.09568 |     507.32804 |   20616.17042 |   19961.03069
   4   |       4345 |    3344.61014 |      85.07847 |    2852.64561 |    2853.14748
   5   |      68214 |  136918.26574 |       9.72853 |    1030.72084 |    1021.91198
-------+------------+---------------+---------------+---------------+---------------
 Total |      78169 |  151936.43174 |    1202.32244 |   43009.13788 |   48611.06267

maoueh avatar Mar 10 '21 16:03 maoueh

Same problem here, after a power outage that stopped my node for 4 days. Until then the node had been working without problems.

So most people had to resync the whole chaindata. Unreliable. Seems a joke.

manelio avatar Mar 20 '21 22:03 manelio

This happened to me while syncing an archive node, about two days (1.2 TB) in. I just let it run for a while, and it resumed its normal behavior after ~5 minutes.

zeroXbrock avatar May 07 '21 02:05 zeroXbrock

I started to see this issue after upgrading my geth version to:

Version: 1.10.4-stable
Git Commit: aa637fd38a379db6da98df0d520fb1c5139a18ce
Architecture: amd64
Go Version: go1.16.4
Operating System: linux
GOPATH=
GOROOT=go

With my earlier version, of which version number I did not record, I managed to sync geth just fine.

I start with a fresh data folder. After a while (~1h), for the snap sync, geth stops syncing and only does database compaction.

Here is a log extract:

INFO [06-22|19:23:23.632] Imported new block headers               count=2048 elapsed=354.813ms number=2,837,698  hash=426fb5..cbbc99 age=4y6mo3w
INFO [06-22|19:23:24.001] Imported new block headers               count=2048 elapsed=359.921ms number=2,839,746  hash=097d57..0381e9 age=4y6mo3w
INFO [06-22|19:23:24.582] Imported new block receipts              count=2048 elapsed=1.018s    number=2,795,628  hash=182155..b305b6 age=4y7mo3d   size=4.13MiB
WARN [06-22|19:23:26.231] Database compacting, degraded performance database=/root/.ethereum/geth/chaindata
WARN [06-22|19:23:28.438] Dropping unsynced node during sync       id=b3f5796c6727a343 conn=inbound addr=86.125.31.39:60186    type=ethereumjs-devp2p/v4...
ERROR[06-22|19:24:10.244] Snapshot extension registration failed   peer=3af25a30 err="peer connected on snap without compatible eth support"
WARN [06-22|19:24:15.699] Dropping unsynced node during sync       id=9b986db754dae7bf conn=inbound addr=15.161.212.85:54304   type=Nethermind/v1.10.73-...
WARN [06-22|19:24:18.716] Dropping unsynced node during sync       id=82b68c45c1649e07 conn=dyndial addr=45.77.49.160:30305    type=Geth/v1.10.2-stable-...
WARN [06-22|19:24:26.242] Database compacting, degraded performance database=/root/.ethereum/geth/chaindata
WARN [06-22|19:24:39.669] Dropping unsynced node during sync       id=d0ad027d764435b3 conn=inbound addr=139.99.9.58:59418     type=Geth/v1.9.24-stable-...
WARN [06-22|19:25:26.254] Database compacting, degraded performance database=/root/.ethereum/geth/chaindata
WARN [06-22|19:25:27.664] Dropping unsynced node during sync       id=9571cfd7011a9334 conn=dyndial addr=45.32.166.77:30305    type=Geth/v1.10.2-stable-...
WARN [06-22|19:25:28.378] Dropping unsynced node during sync       id=c3ce42a8b5ccb91f conn=dyndial addr=143.198.114.251:30303 type=Geth/v1.10.1-stable-...
WARN [06-22|19:26:26.263] Database compacting, degraded performance database=/root/.ethereum/geth/chaindat

Geth does not never seem to resume the syncing after encountering the Database compacting issue, but is forever stuck compacting the database.

I tried to explicitly disable the database compacting with the command line:

geth --http --cache 8000 --ipcpath /root/geth.ipc --gcmode "archive" --syncmode snap --nocompaction

... but it still does the Database compatcing and --nocompaction flag seem to have no effect.

The result is that I cannot get geth synced. I am not sure if this is related to geth version upgrade or something else.

The server is running with NVMe disk directly attached to it, so it should have plenty of IO power and lack of IO should not be an issue.

If I abort the sync with CTRL+C and restart, geth seems to be syncing just fine until it hits the first Database compacting and ends up to the same compact loop.

miohtama avatar Jun 22 '21 17:06 miohtama

Even after issuing shutdown with CTRL+C the database seems to be compacting and does not have an end in sight.

^CINFO [06-22|20:00:58.344] Got interrupt, shutting down...
INFO [06-22|20:00:58.347] HTTP server stopped                      endpoint=127.0.0.1:8545
INFO [06-22|20:00:58.347] IPC endpoint closed                      url=/root/geth.ipc
WARN [06-22|20:00:58.348] Rewinding blockchain                     target=3,980,454
WARN [06-22|20:01:06.332] Checkpoint challenge timed out, dropping id=dce3679d582caf1b conn=inbound addr=79.142.129.218:34916  type=erigon/v2021.06.4/li...
WARN [06-22|20:01:09.887] Database compacting, degraded performance database=/root/.ethereum/geth/chaindata
WARN [06-22|20:02:09.897] Database compacting, degraded performance database=/root/.ethereum/geth/chaindata
WARN [06-22|20:03:09.908] Database compacting, degraded performance database=/root/.ethereum/geth/chaindata

miohtama avatar Jun 22 '21 18:06 miohtama

I downgraded to Geth 1.10.1 and now was able to sync fine. Though the database compacting still happens, it does not grind the process to a halt. Maybe some database compacting settings have been changed more aggressive recently?

miohtama avatar Jun 23 '21 07:06 miohtama

Does anyone know why this happens? I am trying to sync a new node (1.10.4) which is going very well, but then grinds to a halt for ~25 minutes. It goes completely AWOL to do this compacting. I'm scared to run my validator on a node that just stops for 25 minutes.

ryny24 avatar Jul 03 '21 13:07 ryny24

Hi there. Facing the same issue here, using geth 1.10.5-stable-33ca98ec

Due to it seems --nocompacting is not working, any other feedback/recommendation? Thanks!

Jul 16 08:34:42 baremetal01 geth[2830]: WARN [07-16|08:34:42.850] Database compacting, degraded performance database=/var/lib/goethereum/geth/chaindata Jul 16 08:35:42 baremetal01 geth[2830]: WARN [07-16|08:35:42.862] Database compacting, degraded performance database=/var/lib/goethereum/geth/chaindata Jul 16 08:36:42 baremetal01 geth[2830]: WARN [07-16|08:36:42.872] Database compacting, degraded performance database=/var/lib/goethereum/geth/chaindata Jul 16 08:37:42 baremetal01 geth[2830]: WARN [07-16|08:37:42.880] Database compacting, degraded performance database=/var/lib/goethereum/geth/chaindata

fturriaf avatar Jul 16 '21 13:07 fturriaf

note: current information on this issue as of 2021-11-16 is https://github.com/ethereum/go-ethereum/issues/22056#issuecomment-970486232

This issue is tagged 'performance' but it brings down an entire node.

This happened to me (long term compacting, nothing else) after a power failure. [edit] It looks like it could be exacerbated by failure to shut down properly. My ethereum folder is on a RAID 6 with only 4/5 disks present; the filesystem is ext4 and is mounted with journal_data . Replaying the journal took a very long time on powerup. My system is a raspberry pi running ubuntu amd64 with only 3.3G of userspace ram and 8G of microsd swap.

I tried downgrading from 1.10.4 to 1.10.1 and this did not resolve the issue.

My node was still syncing when the power failure happened. It was doing a normal sync, after previously trying it for a day or two doing a fast sync. It was running into occasional crashes and I had it on a scripted loop to auto-reboot.

xloem avatar Jul 17 '21 02:07 xloem

Well, there was no interruption on my system. It is on battery backup in a data center. Geth was running with no issues when this happened.

ryny24 avatar Jul 17 '21 13:07 ryny24

I am now getting this reliably, even after wiping chaindata. What's considered an acceptable time to wait to get through database compaction? Here's a trace log of a run: https://gateway.ipfs.io/ipfs/QmUGSGz9QPo2fcU6zSpZ2EsPzo7Zo3PLM4jMoZ9HTQDGH7 (just pinned on my pi atm, so download if needed) edit: ping me if you want me to bring this back up, it's on my ubuntu annexpi's ipfs snap

xloem avatar Jul 17 '21 15:07 xloem

I'm on Geth 1.10.6 and syncing for the first time and getting this Database compacting, degraded performance message as well on a regular basis. it seems to resolve itself after a while and then start again. In fact his can be easily visualized from a System Metrics dashboard:

Geth database compaction

CPU usage drops and network usage as well during compaction. compaction runs for about 5min. This of course also greatly slows down beaconchain synctime. I do have an infura fallback but not sure how fast that kicks in.

I have a 6-core Xeon however ssd is a crucial p2, no dram so not the fastest breed. Could this be the cause? Didn't observe this behavior on testnet.

PerfectDark99 avatar Jul 28 '21 17:07 PerfectDark99

My raspberry pi is working after being left alone for a few days. I have not seen the issue since. I personally wonder if it relates to network behavior, as I've had other p2p issues, but I have no expertise or experience here.

xloem avatar Jul 29 '21 11:07 xloem

i'm on a pi as well and experiencing this issue...did we find any definitive solution?

oslfmt avatar Aug 21 '21 00:08 oslfmt

i'm on a pi as well and experiencing this issue...did we find any definitive solution?

I haven't fully synced yet due to other issues, but if you can keep your pi online this long, giving it up to a week to resolve on its own seems like it would be a helpful litmus test.

xloem avatar Aug 21 '21 06:08 xloem

I kept my pi on for about 7 hrs before just rebooting. I didn't wait a week or any longer cuz I was pretty sure the process had grinded to a halt. I used to nohup and & to run the process in the background and exit my ssh connection without it stopping. I was monitoring it with htop, and there was no process running related to geth, and barely any processor power was being consumed. So I concluded it probably stopped. Restarted and now it seems to be humming along again.

oslfmt avatar Aug 21 '21 16:08 oslfmt

Seeing the same issue even after a rebuild...

RobertClarke avatar Oct 05 '21 13:10 RobertClarke

I have this problem after a sudden power cut. It persists even after removing the data directory and starting fresh. My hypothesis is that the SSD filesystem somehow got corrupted by experiencing a power loss during writing.

TimoSci avatar Nov 09 '21 23:11 TimoSci

I am running an eth2 node with geth.

This has happened to me as well since upgrading to the geth version that had corruption issues (1.10.10 to 1.10.11). This has persisted over a month often throwing ext4-fs errors. I have run SMARTmon and fsck from a live CD on the drives, resynced chaindata from scratch (several times). And I cannot get geth to run anymore without causing my node to hang up.

I am able to sync with Alchemy in the meantime, but would love any help to get back to having my local geth operational.

calamenium avatar Nov 16 '21 15:11 calamenium