bolt unreachable unfreed

trafficstars

after doing 10 million Puts in one bucket of an empty database using as key a timestamp in nanoseconds and value 8 zero bytes with batches of 100 thousand everything ends fine during execution (I'm using https://github.com/boltdb/coalescer, with 1ms interval), no errors or nothing.

stats just after Puts done:

(bolt.Stats) {
 FreePageN: (int) 0,
 PendingPageN: (int) 66317,
 FreeAlloc: (int) 271634432,
 FreelistInuse: (int) 530552,
 TxN: (int) 0,
 OpenTxN: (int) 0,
 TxStats: (bolt.TxStats) {
  PageCount: (int) 4279034,
  PageAlloc: (int) 17560088576,
  CursorCount: (int) 10000204,
  NodeCount: (int) 4164741,
  NodeDeref: (int) 17049,
  Rebalance: (int) 0,
  RebalanceTime: (time.Duration) 0,
  Split: (int) 114188,
  Spill: (int) 4278932,
  SpillTime: (time.Duration) 1m0.497335167s,
  Write: (int) 4279136,
  WriteTime: (time.Duration) 2m3.890396496s
 }
}

cmd bolt stats

Aggregate statistics for 1 buckets

Page count statistics
    Number of logical branch pages: 1022
    Number of physical branch overflow pages: 0
    Number of logical leaf pages: 113170
    Number of physical leaf overflow pages: 0
Tree statistics
    Number of keys/value pairs: 10000000
    Number of levels in B+tree: 4
Page size utilization
    Bytes allocated for physical branch pages: 4186112
    Bytes actually used for branch data: 2756936 (65%)
    Bytes allocated for physical leaf pages: 463544320
    Bytes actually used for leaf data: 321810720 (69%)
Bucket statistics
    Total number of buckets: 1
    Total number on inlined buckets: 0 (0%)
    Bytes used for inlined buckets: 0 (0%)

cmd bolt check throws:

page 179173: unreachable unfreed
1 errors found

cmd bolt pages | grep "179173":

179173   leaf       2

db file: https://www.dropbox.com/s/b7fcmb9h050uzpr/dbx.tst.tar.gz?dl=0

(I'm using the master branch at b8dbe1101d74c30e817227a99716591e3e6e6bc4) any ideas ?

how to reproduce the error:

cmd bolt bench (using sequential keys returns 'OK'):

bolt bench --write-mode 'rnd' --count '10000000' --batch-size '100000' --work
bolt check /tmp/bolt-bench-xxxxxxxx

code:

func TestBolt0(t *testing.T) {
    db := open()
    bucket := []byte("bucket")

    if err := db.Update(func(tx *bolt.Tx) error {
        _, err := tx.CreateBucket(bucket)
        if err != nil {
            return err
        }
        return nil
    }); err != nil {
        t.Fatalf("update failed: %s", err)
    }

    var value []byte
    var pkey []byte
    value = make([]byte, 8)

    for i := 0; i < 100; i++ {
        fmt.Println("FLUSH N°", i)
        db.Update(func(tx *bolt.Tx) error {
            for i := 0; i < 100000; i++ {
                pkey = NewKey()
                if err := tx.Bucket(bucket).Put(pkey, value); err != nil {
                    t.Fatal(err)
                }

            }
            return nil
        })

    }

    spew.Dump(db.Stats(), db.Path())
    spew.Dump(db)
        db.Close()
}
func open() *bolt.DB {
    db, err := bolt.Open("/tmp/dbx.tst", 0600, nil)
    if err != nil {
        panic("open: " + err.Error())
    }
    return db
}
func NewKey() (key []byte) {
    key = make([]byte, 8)
    binary.LittleEndian.PutUint64(key, uint64(time.Now().UnixNano()))
    return
}

Hints:

when trying two separate write instances of 5 million Puts in batches of 80 thousand+ each on the same database:

the 1st instance:
has no execution errors
cmd bolt check returns 'OK'.

the 2nd instance:
has no execution errors
cmd bolt check returns one error: 'page XXXXXX: unreachable unfreed'

general observations after trying different configurations:

bolt check returns one unreachable unfreed page error.

even with strictMode set to true, no error, panic or anything shows up.

error appears when using batches of ~80 thousand or higher.

error appears only when using non sequential keys.

error appears when writing large records (~10 Million key value records and over)

im guessing is a capacity issue with freelist, maybe at the commit stage when the freelist is written to the disk it cuts the last page id ?

Update

even when the cmd tool reports that the database has one error, after checking the key and values stored they all seem intact with no errors occurring when reading it, why ? code:

func TestBolt0(t *testing.T) {
    db := open()
        db.StrictMode = true // even without StrictMode
    bucket := []byte("bucket")

    if err := db.Update(func(tx *bolt.Tx) error {
        _, err := tx.CreateBucket(bucket)
        if err != nil {
            return err
        }
        return nil
    }); err != nil {
        t.Fatalf("update failed: %s", err)
    }

    fmt.Println("WRITING")
    lessTime := uint64(time.Now().UnixNano())
    var valueCount uint64
    var keySum uint64
    var operations int = 10000000
    var batchSize int = 100000

    for i := 0; i < operations/batchSize; i++ {
        db.Update(func(tx *bolt.Tx) error {
            for i := 0; i < batchSize; i++ {
                value := make([]byte, 8)
                valueCount++
                pkeyuint, pkey := NewKey(lessTime)
                keySum = keySum + pkeyuint
                binary.PutUvarint(value, valueCount)
                //fmt.Println(pkeyuint, valueCount)
                if err := tx.Bucket(bucket).Put(pkey, value); err != nil {
                    t.Fatal(err)
                }

            }
            return nil
        })
        fmt.Println("FLUSH N°", i)

    }

    fmt.Println("keysum:", keySum)
    spew.Dump(db.Stats())
    spew.Dump(db)
    dbfile := db.Path()
    db.Close()

    time.Sleep(3 * time.Second)

    db, err := bolt.Open(dbfile, 0600, nil)
    if err != nil {
        panic("open: " + err.Error())
    }

    var vsumtotal uint64
    var ksumtotal uint64

    fmt.Println("READING")
    db.View(func(tx *bolt.Tx) error {
        b := tx.Bucket([]byte(bucket))
        c := b.Cursor()

        for k, v := c.First(); k != nil; k, v = c.Next() {
            if val, n := binary.Uvarint(v); n <= 0 {
                t.Fatal(err)
            } else {
                //fmt.Println("val", val)
                vsumtotal = vsumtotal + val
            }
            if key, n := binary.Uvarint(k); n <= 0 {
                t.Fatal(err)
            } else {
                //fmt.Println("key", key)
                ksumtotal = ksumtotal + key
            }

        }

        return nil
    })

    p1 := operations + 1
    p2 := operations * p1
    valueSum := uint64(p2 / 2)
    if vsumtotal != valueSum {
        t.Error("Value sum error, got", valueSum, " instead:", vsumtotal)
    }
    if ksumtotal != keySum {
        t.Error("Key sum error, got", keySum, " instead:", ksumtotal)
    }
    fmt.Println(vsumtotal)
    spew.Dump(db.Stats())
    spew.Dump(db)

    if err := db.View(func(tx *bolt.Tx) error {
        var count int
        ch := tx.Check()
    loop:
        for {
            select {
            case err, ok := <-ch:
                if !ok {
                    break loop
                }
                fmt.Println(err)
                count++
            }
        }

        // Print summary of errors.
        if count > 0 {
            t.Fatalf("%d errors found", count)
        } else {
            fmt.Println("OK")
        }
        return nil
    }); err != nil {
        t.Fatal(err)
    }

    //closedb(db)
}
func NewKey(less uint64) (keyUint uint64, keyBytes []byte) {
    keyBytes = make([]byte, 8)
    keyUint = uint64(time.Now().UnixNano()) - less
    binary.PutUvarint(keyBytes, keyUint)
    return
}

func tempfile() string {
    f, _ := ioutil.TempFile("", "coalescer-")
    path := f.Name()
    f.Close()
    os.Remove(path)
    return path
}

func open() *bolt.DB {
    db, err := bolt.Open(tempfile(), 0600, nil)
    if err != nil {
        panic("open: " + err.Error())
    }
    return db
}

func closedb(db *bolt.DB) {
    if db == nil {
        return
    }
    path := db.Path()
    db.Close()
    if path != "" {
        os.Remove(path)
    }
}

https://groups.google.com/d/msg/influxdb/eL-ijjcQnNg/MDFQtiZxSfkJ ?

(ref https://github.com/boltdb/bolt/issues/301)

Feb 05 '15 23:02 alvarolm

Do you see the same issue if you don't use the coalescer?

Feb 05 '15 23:02 benbjohnson

yes, it happens without coalescer also, I just updated the first comment with more info, please check it out. guys @mkobetic @snormore @tv42 had experienced the same problem ?

Feb 06 '15 22:02 alvarolm

Hi,

Is there any progress here? I have a large (16GB) file and we eventually run into the same issue. Our error output:

Issue opening disk cache: page 3408082: unreachable unfreed
Iteration error: page 3408082: unreachable unfreed

Inside godeps

                {
                        "ImportPath": "github.com/boltdb/bolt",
                        "Comment": "v1.0-79-g2c04100",
                        "Rev": "2c04100eb9793f2b8541d243494e2909d2112325"
                },

Jul 15 '15 00:07 cep21

@cep21 Thanks for pinging me on this issue. It got lost. Sorry about that. I'm looking into it now.

Jul 15 '15 03:07 benbjohnson

My production servers consistently return "page ***: unreachable unfreed" after running for a few days when I do a check at the beginning of any restart. My workload is millions of small key/value pairs inside a single bucket that is mostly read heavy. Is this an innocuous error? I never get errors while running the database, just when restarting and doing a fresh Check()

Jul 15 '15 18:07 cep21

@cep21 It's not a great error but it's likely not hurting anything if you're not seeing any other errors. It means that the database is missing a reference to that page and it's probably getting lost off the freelist for some reason so you simply can't reclaim it. I can reproduce the issue with 5.6M random inserts with 100K batches:

$ bolt bench -write-mode rnd --count 5600000 -batch-size 100000 -work

It takes 3m for each run so narrowing down the exact code path that it's occurring on takes some time. I'm looking into it.

Jul 15 '15 19:07 benbjohnson

Cool. As an aside, we've tested out 1.5beta1 on our use of bolt and consistently show benchcmp results of -20% to -50% on ns/op for basic usage of bolt on 1.5beta1. I haven't dug into why, but bolt appears much faster in 1.5

Jul 15 '15 19:07 cep21

That's great to hear. Bolt uses a fair number of objects for the in-memory representation before it commits to disk so it could just be the new GC.

Jul 15 '15 20:07 benbjohnson

@cep21 1.5 would set GOMAXPROCs to the number of your cores instead of 1. I guess that helped to "improve" the performance a lot.

Jul 15 '15 22:07 xiang90

I ran the 1.4.2 and 1.5 tests with the same number of cores: 8 (which is correct for my test machine)

Jul 15 '15 22:07 cep21

Just adding myself here, as I've seen similar behavior with current bolt version. Can try testing my workland against 1.5, but I don't actually see a branch for it. Where does it live?

Jul 22 '15 13:07 nicpottier

The result I'm observing of this bug is an unnatural growth in the size of the on disk cache, causing increased CPU and page throttling doing reads. This is probably a reasonably important issue for long running servers.

Aug 28 '15 21:08 cep21

@cep21 Have you tried to find the root cause?

Aug 28 '15 21:08 xiang90

I haven't spent a large amount of time. My current solution is to rebuild the database entirely on a weekly cadence.

Aug 28 '15 22:08 cep21

bolt bolt copied to clipboard

unreachable unfreed

how to reproduce the error:

Hints:

Update

https://groups.google.com/d/msg/influxdb/eL-ijjcQnNg/MDFQtiZxSfkJ ?

bolt
bolt copied to clipboard