bbolt icon indicating copy to clipboard operation
bbolt copied to clipboard

Corrupted db file when the vm got turned off because of an overload

Open bharathramh92 opened this issue 6 years ago • 13 comments

OS: MacOS with RHEL VM

The db file got corrupted when the MAC OS decided to restart by itself and my program was running in RHEL VM. Following is the check output.

$ bolt check tmp.db
page 0: multiple references
page 0: invalid type: unknown<00>
panic: invalid page type: 0: 0

goroutine 5 [running]:
panic(0x4e4120, 0xc420010610)
        /usr/lib/golang/src/runtime/panic.go:500 +0x1a1
github.com/boltdb/bolt.(*Cursor).search(0xc42003eba8, 0x7f50350f20f0, 0xa, 0xa, 0x1bb69)
        /opt/pindrop/include/go/src/github.com/boltdb/bolt/cursor.go:256 +0x429
github.com/boltdb/bolt.(*Cursor).seek(0xc42003eba8, 0x7f50350f20f0, 0xa, 0xa, 0x0, 0x0, 0x4f77a0, 0xc42000a3f0, 0x2, 0x2, ...)
        /opt/pindrop/include/go/src/github.com/boltdb/bolt/cursor.go:159 +0xb1
github.com/boltdb/bolt.(*Bucket).Bucket(0xc420078018, 0x7f50350f20f0, 0xa, 0xa, 0x0)
        /opt/pindrop/include/go/src/github.com/boltdb/bolt/bucket.go:112 +0x108
github.com/boltdb/bolt.(*Tx).checkBucket.func2(0x7f50350f20f0, 0xa, 0xa, 0x7f50350f20fa, 0x66, 0x66, 0x66, 0x0)
        /opt/pindrop/include/go/src/github.com/boltdb/bolt/tx.go:449 +0x70
github.com/boltdb/bolt.(*Bucket).ForEach(0xc420078018, 0xc42003ecc0, 0x0, 0xc42003ecf0)
        /opt/pindrop/include/go/src/github.com/boltdb/bolt/bucket.go:390 +0xff
github.com/boltdb/bolt.(*Tx).checkBucket(0xc420078000, 0xc420078018, 0xc42003eea0, 0xc42003eed0, 0xc4200540c0)
        /opt/pindrop/include/go/src/github.com/boltdb/bolt/tx.go:453 +0x135
github.com/boltdb/bolt.(*Tx).check(0xc420078000, 0xc4200540c0)
        /opt/pindrop/include/go/src/github.com/boltdb/bolt/tx.go:404 +0x5f7
created by github.com/boltdb/bolt.(*Tx).Check
        /opt/pindrop/include/go/src/github.com/boltdb/bolt/tx.go:379 +0x67

Is there a way to fix the db file by any means? I check https://github.com/boltdb/bolt/issues/348 and my version (ee30b748bcfbd74ec1d8439ae8fd4f9123a5c94e) is greater than that .

Note that it didn't happen again when i tried to reproduce again by powering off the virtual machine manually from MAC OS.

bharathramh92 avatar Jun 14 '18 17:06 bharathramh92

Had anyone else bumped into this issue?

bharathramh92 avatar Jun 18 '18 13:06 bharathramh92

Can any maintainer check this?

bharathramh92 avatar Jul 24 '18 00:07 bharathramh92

ping?

vtolstov avatar Aug 17 '18 17:08 vtolstov

is this repo actively maintained?

bharathramh92 avatar Aug 17 '18 17:08 bharathramh92

I don't know, but it's still the main fork that I know of. The original fork was archived because it was considered to already be complete and they didn't want to weigh it down with extra features.

Virtual machines are tricky. You didn't say what you ran the VM in, but VirtualBox for example ignores flush requests by default, which Bolt (and every other database) depends on to ensure that writes occur in the correct order. That's not a problem if it's shut down normally, but a forced shutdown outside of the VM software's control can lead to partial, out-of-order writes which lead to corruption.

dtfinch avatar Aug 17 '18 19:08 dtfinch

I have the same problem that it happened on Windows XP. I use the repo on release project and it happend yesterday. I didn't run it on the VM and didn't power off the system. I just used the put function to save some info and the bucket can be readed and cannot be writed.

liqingsanjin avatar Aug 18 '18 00:08 liqingsanjin

@dtfinch It was a redhat OS in VM. In that case, how would the accidental power failure case be?

bharathramh92 avatar Aug 18 '18 03:08 bharathramh92

@liqingsanjin I just got corrupted for no reason?

I just used the put function to save some info and the bucket can be readed and cannot be writed. Can you explain how it was done?

bharathramh92 avatar Aug 18 '18 03:08 bharathramh92

@bharathramh92 Sorry I don't know how it happened. I deploy my program on 600+ computers that operation system are windows 7 and windows XP. It's about a month since I deploy my program. It's no problem until yesterday. From log files of my program, I saw that when my program tried to write a bucket and then it panic an error which is same of yours, but the bucket can be read. I tried to restart the program and windows. It can't be write any more. Following is my log out: time="2018-08-17T22:04:16+08:00" level=error msg="invalid page type: 0: 0"

liqingsanjin avatar Aug 18 '18 11:08 liqingsanjin

@liqingsanjin that is so strange. I never had that issue.

bharathramh92 avatar Aug 21 '18 01:08 bharathramh92

I have the same problem.

xiusin avatar Sep 15 '18 09:09 xiusin

I saw a similar problem:

invalid page type: 0: 0
  File "go.etcd.io/[email protected]/cursor.go", line 250, in go.etcd.io/bbolt.(*Cursor).search
  File "go.etcd.io/[email protected]/cursor.go", line 159, in go.etcd.io/bbolt.(*Cursor).seek
  File "go.etcd.io/[email protected]/bucket.go", line 105, in go.etcd.io/bbolt.(*Bucket).Bucket
  File "go.etcd.io/[email protected]/tx.go", line 101, in go.etcd.io/bbolt.(*Tx).Bucket

This message comes from:

https://github.com/etcd-io/bbolt/blob/4b8b43e23cceca257d3d2958882dec02d9b16c69/cursor.go#L249-L250

So p.id == 0 and also p.flags == 0. If this is truly page 0, it should have flags = metaPageFlag set, and regardless flags == 0 is not one of the valid values:

https://github.com/etcd-io/bbolt/blob/4b8b43e23cceca257d3d2958882dec02d9b16c69/cmd/bbolt/main.go#L1841-L1846

Unfortunately I don't have access to the db file that caused the issue in my case, but if someone else does I would suggest looking at the backup meta on page 2 to see if its correct.

tmm1 avatar Oct 27 '22 23:10 tmm1

The page was somehow reset, in other words, all content in the page are zero values. FYI. https://github.com/etcd-io/bbolt/pull/520

ahrtr avatar Jun 01 '23 09:06 ahrtr