badger icon indicating copy to clipboard operation
badger copied to clipboard

[BUG]: Panic after power fail

Open dgallion1 opened this issue 3 years ago • 27 comments

What version of Badger are you using?

v3.2103.2

What version of Go are you using?

go version go1.19.4 windows/amd64

Have you tried reproducing the issue with the latest release?

None

What is the hardware spec (RAM, CPU, OS)?

OS Name Microsoft Windows 11 Pro Version 10.0.22621 Build 22621 Other OS Description Not Available OS Manufacturer Microsoft Corporation System Name AMD2 System Manufacturer ASUS System Model System Product Name System Type x64-based PC System SKU SKU Processor AMD Ryzen 9 7950X 16-Core Processor, 4501 Mhz, 16 Core(s), 32 Logical Processor(s) BIOS Version/Date American Megatrends Inc. 0705, 10/5/2022 SMBIOS Version 3.5 Embedded Controller Version 255.255 BIOS Mode UEFI BaseBoard Manufacturer ASUSTeK COMPUTER INC. BaseBoard Product ROG CROSSHAIR X670E EXTREME BaseBoard Version Rev 1.xx Platform Role Desktop Secure Boot State Off PCR7 Configuration Elevation Required to View Windows Directory C:\Windows System Directory C:\Windows\system32 Boot Device \Device\HarddiskVolume1 Locale United States Hardware Abstraction Layer Version = "10.0.22621.819" User Name AMD2\dgall Time Zone Eastern Standard Time Installed Physical Memory (RAM) 128 GB Total Physical Memory 127 GB Available Physical Memory 108 GB Total Virtual Memory 145 GB Available Virtual Memory 119 GB Page File Space 18.0 GB Page File C:\pagefile.sys Kernel DMA Protection Off Virtualization-based security Running Virtualization-based security Required Security Properties Virtualization-based security Available Security Properties Base Virtualization Support, DMA Protection, UEFI Code Readonly, SMM Security Mitigations 1.0, Mode Based Execution Control Virtualization-based security Services Configured Hypervisor enforced Code Integrity Virtualization-based security Services Running Credential Guard, Hypervisor enforced Code Integrity Windows Defender Application Control policy Enforced Windows Defender Application Control user mode policy Off Device Encryption Support Elevation Required to View A hypervisor has been detected. Features required for Hyper-V will not be displayed.

What steps will reproduce the bug?

Power removed.

Expected behavior and actual result.

After a power failure, running Windows 11.

panic: runtime error: slice bounds out of range [-16:] [recovered] panic: == Recovering from initIndex crash == File Info: [ID: 1157, Size: 8545749, Zeros: 8545749] isEnrypted: true checksumLen: 0 checksum: indexLen: 0 index: [] == Recovered ==

goroutine 33 [running]: github.com/dgraph-io/badger/v3/table.(*Table).initBiggestAndSmallest.func1.1() C:/Users/dgall/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/table/table.go:351 +0xa8 github.com/dgraph-io/badger/v3/table.(*Table).initBiggestAndSmallest.func1() C:/Users/dgall/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/table/table.go:397 +0xc2 panic({0x7ff7462d5cc0, 0xc00059e018}) C:/Program Files/Go/src/runtime/panic.go:884 +0x212 github.com/dgraph-io/badger/v3/table.(*Table).decrypt(0xc00087c030?, {0x22decfa65cd?, 0xc000159970?, 0x7ff745b5e2ce?}, 0xcd?) C:/Users/dgall/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/table/table.go:753 +0x16b github.com/dgraph-io/badger/v3/table.(*Table).readTableIndex(0xc00021ee40) C:/Users/dgall/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/table/table.go:702 +0x5c github.com/dgraph-io/badger/v3/table.(*Table).initIndex(0xc00021ee40) C:/Users/dgall/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/table/table.go:462 +0x19d github.com/dgraph-io/badger/v3/table.(*Table).initBiggestAndSmallest(0xc00021ee40) C:/Users/dgall/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/table/table.go:401 +0x7f github.com/dgraph-io/badger/v3/table.OpenTable(0xc000142260, {0x0, 0x1, 0x200000, 0x0, 0x0, 0x3f847ae147ae147b, 0x1000, 0xc000852200, 0x1, ...}) C:/Users/dgall/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/table/table.go:308 +0x272 github.com/dgraph-io/badger/v3.newLevelsController.func1({0xc0004a60c0, 0x13}, {0x0?, 0x0?, 0x0?}) C:/Users/dgall/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/levels.go:150 +0x1f9 created by github.com/dgraph-io/badger/v3.newLevelsController C:/Users/dgall/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/levels.go:129 +0x585

Additional information

On Windows. Our production DB is dead so often that we take a full backup every 5min. Then attempt to recover and restore if needed. Or keys are lost, and we don't know that. Can't capture the error with recover(). How is that possible?

dgallion1 avatar Feb 23 '23 14:02 dgallion1

seems related to #1625 #1626 and DGRAPH-2841 on JIRA board.

It seems like the whole file has 0s in it and no data. We do not support windows officially yet, but I am curious why it would crash every 5 min. It is crashing with this issue every 5 min?

mangalaman93 avatar Mar 18 '23 08:03 mangalaman93

also related to #1797

mangalaman93 avatar Mar 18 '23 08:03 mangalaman93

It's not crashing every 5min. We take a complete backup every 5min and keep them in min/hour/day folders. So after a crash, we can recover. Sometimes the DB is just corrupted, and we can't detect that. So 5min backups start being overwritten with insufficient data.

Thinking about always backing up during shutdown. Then using the backup when starting back up.

I don't know what to do about the silent loss of keys.

dgallion1 avatar Mar 18 '23 14:03 dgallion1

If you could share your last backup and the data after immediate crash (If it is not confidential) and along with logs, that would give up good amount of information to debug this.

As I said, we currently do not run our tests on windows, I am not sure if this will change in future. mmap could be acting different on windows leading to corruption of the db. I do see quite a few open issues on windows related to this, maybe that pushes us to at least ensure our tests pass on windows. Thanks for information in any case.

mangalaman93 avatar Mar 18 '23 16:03 mangalaman93

I appreciate your interest. I'll try to gather more data for you. Confidential data is an issue.

dgallion1 avatar Mar 18 '23 20:03 dgallion1

It's not crashing every 5min. We take a complete backup every 5min and keep them in min/hour/day folders.

How do you do this?

Since it's ACID compliant, it should work. But only if you take an atomic filesystem snapshot. Copying the database just with rsync during it's being used certainly is not the way to go.

shufps avatar Mar 19 '23 06:03 shufps

A task reads all keys/values and writes them to a file.

dgallion1 avatar Mar 19 '23 12:03 dgallion1

panic: runtime error: slice bounds out of range [-16:] [recovered] panic: == Recovering from initIndex crash == File Info: [ID: 1193, Size: 5060869, Zeros: 5060869] isEnrypted: true checksumLen: 0 checksum: indexLen: 0 index: [] == Recovered ==

goroutine 58 [running]: github.com/dgraph-io/badger/v3/table.(*Table).initBiggestAndSmallest.func1.1() C:/Users/dgall/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/table/table.go:351 +0xa8 github.com/dgraph-io/badger/v3/table.(*Table).initBiggestAndSmallest.func1() C:/Users/dgall/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/table/table.go:397 +0xc2 panic({0x7ff62bd59980, 0xc00012a330}) C:/Program Files/Go/src/runtime/panic.go:884 +0x213 github.com/dgraph-io/badger/v3/table.(*Table).decrypt(0xc000612110?, {0x1a2e75638fd?, 0xc0001cb970?, 0x7ff62b5d02ce?}, 0xfd?) C:/Users/dgall/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/table/table.go:753 +0x16b github.com/dgraph-io/badger/v3/table.(*Table).readTableIndex(0xc00059c0c0) C:/Users/dgall/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/table/table.go:702 +0x5c github.com/dgraph-io/badger/v3/table.(*Table).initIndex(0xc00059c0c0) C:/Users/dgall/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/table/table.go:462 +0x19d github.com/dgraph-io/badger/v3/table.(*Table).initBiggestAndSmallest(0xc00059c0c0) C:/Users/dgall/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/table/table.go:401 +0x7f github.com/dgraph-io/badger/v3/table.OpenTable(0xc000236000, {0x0, 0x1, 0x200000, 0x0, 0x0, 0x3f847ae147ae147b, 0x1000, 0xc00016a780, 0x1, ...}) C:/Users/dgall/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/table/table.go:308 +0x272 github.com/dgraph-io/badger/v3.newLevelsController.func1({0xc00012a2a0, 0x13}, {0x0?, 0x0?, 0x0?}) C:/Users/dgall/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/levels.go:150 +0x1f9 created by github.com/dgraph-io/badger/v3.newLevelsController C:/Users/dgall/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/levels.go:129 +0x585

dgallion1 avatar Mar 30 '23 13:03 dgallion1

A task reads all keys/values and writes them to a file.

That's not a clean and consistent approach. If you have to do without shutdown, you have to do it atomically on filesystem level and rely on ACID conformity and recovery.

shufps avatar Apr 28 '23 16:04 shufps

It can't be shut down every 5min to do a backup. A Windows update or power failure has been known to crash the DB. A hot backup is the best I know of. Fortunately, the design replicates data from other Badger DBs. So much of the DB heals itself. Looking at pushing the rest of it into something like Postgres. Going full Rube Goldberg here, then restoring critical data from Postgres at startup.

dgallion1 avatar Apr 28 '23 19:04 dgallion1

It can't be shut down every 5min to do a backup.

Do an atomic file-system snapshot. They are provided by e.g. zfs, btrfs. Or if you have the money storage systems that support them like from Netapp (but also others ofc because this is standard)

This should allow you to do hot-backups with a consistent state because, as said, it's an atomic operation.

A Windows update or power failure has been known to crash the DB.

Oh Windows ....... Hmm, this is a problem.

shufps avatar May 01 '23 05:05 shufps

Badger provides snapshot isolation. That means, if you use a txn to read data during backup, you will get a consistent snapshot of the data. You don't need to worry about anything in that case and your snapshot is a consistent view of the data at a given point in time.

mangalaman93 avatar May 03 '23 10:05 mangalaman93

This is indeed a problem, but the main issue is that the exception cannot be caught through err or recover, which will result in continuous crashes of the program and make subsequent processing methods unachievable.

package main

import (
        "log"

        "github.com/dgraph-io/badger/v4"
)

func main() {
        // can not recover
        defer func() {
                err := recover()
                if err != nil {
                        log.Fatal("catch panic: ", err)
                }
        }()

        opt := badger.DefaultOptions("../db")
        opt.WithMemTableSize(1 << 20)
        opt.WithValueThreshold(1 << 10)
        opt.WithValueLogFileSize(1 << 20)
        opt.WithLoggingLevel(badger.DEBUG)

        db, err := badger.Open(opt)
        // no err
        if err != nil {
                log.Fatal("have err: ", err)
        }
        db.Close()
}
panic: runtime error: index out of range [3] with length 0 [recovered]
        panic: 
== Recovering from initIndex crash ==
File Info: [ID: 44, Size: 680, Zeros: 680]
isEnrypted: false checksumLen: 0 checksum:  indexLen: 0 index: [] 
== Recovered ==


goroutine 16 [running]:
github.com/dgraph-io/badger/v4/table.(*Table).initBiggestAndSmallest.func1.1()
        /home/muk/codes/lang/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/table/table.go:353 +0xa8
github.com/dgraph-io/badger/v4/table.(*Table).initBiggestAndSmallest.func1()
        /home/muk/codes/lang/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/table/table.go:399 +0xc2
panic({0x867b00, 0xc00001c468})
        /home/muk/.g/go/src/runtime/panic.go:884 +0x213
github.com/google/flatbuffers/go.GetInt32(...)
        /home/muk/codes/lang/go/pkg/mod/github.com/google/[email protected]/go/encode.go:85
github.com/google/flatbuffers/go.GetUOffsetT(...)
        /home/muk/codes/lang/go/pkg/mod/github.com/google/[email protected]/go/encode.go:121
github.com/dgraph-io/badger/v4/fb.GetRootAsTableIndex(...)
        /home/muk/codes/lang/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/fb/TableIndex.go:14
github.com/dgraph-io/badger/v4/table.(*Table).readTableIndex(0xc000258000)
        /home/muk/codes/lang/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/table/table.go:707 +0x15e
github.com/dgraph-io/badger/v4/table.(*Table).initIndex(0xc000258000)
        /home/muk/codes/lang/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/table/table.go:464 +0x19d
github.com/dgraph-io/badger/v4/table.(*Table).initBiggestAndSmallest(0xc000258000)
        /home/muk/codes/lang/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/table/table.go:403 +0x7f
github.com/dgraph-io/badger/v4/table.OpenTable(0xc000140140, {0x0, 0x1, 0x200000, 0x0, 0x0, 0x3f847ae147ae147b, 0x1000, 0x0, 0x1, ...})
        /home/muk/codes/lang/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/table/table.go:309 +0x278
github.com/dgraph-io/badger/v4.newLevelsController.func1({0xc0001320f0, 0x10}, {0x0?, 0x0?, 0x0?})
        /home/muk/codes/lang/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/levels.go:150 +0x1f9
created by github.com/dgraph-io/badger/v4.newLevelsController
        /home/muk/codes/lang/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/levels.go:129 +0x585
exit status 2

gemone avatar May 31 '23 06:05 gemone

Why would that be? You should be able to recover from panic using the recover() function. I don't have a windows machine, but do you have any insights?

mangalaman93 avatar May 31 '23 12:05 mangalaman93

The example that Gemone gave isn't crashing for me. I can't find a case where recover() won't catch the error. This could be what I saw: "In Go 1.16, the recover() function was made more robust by handling panics that occur in deferred functions. Previously, panics in deferred functions would cause the program to crash."

Now use this "github.com/gorpher/gowin32/session_notifications" to close the DB before a restart.

dgallion1 avatar Jun 01 '23 21:06 dgallion1

Why would that be? You should be able to recover from panic using the recover() function. I don't have a windows machine, but do you have any insights?

@mangalaman93 This issue is not related to the system, but rather caused by the power outage that resulted in database file corruption. This issue would occur on any system. The inability to recover() may be due to an internal recover() call within Open(), or a panic occurring within a goroutine.

https://github.com/dgraph-io/badger/blob/f9b9e4d0553849dc518613a6eea20787704d4f12/table/table.go#L353

The example that Gemone gave isn't crashing for me. I can't find a case where recover() won't catch the error. This could be what I saw: "In Go 1.16, the recover() function was made more robust by handling panics that occur in deferred functions. Previously, panics in deferred functions would cause the program to crash."

Now use this "github.com/gorpher/gowin32/session_notifications" to close the DB before a restart.

@dgallion1 I am using a program created with a service that automatically restarts in case of a crash. However, whenever the DB file is corrupted, it results in a panic that cannot recover(), which causes the service to keep restarting.

Go version: 1.20.3 Badger: 4.1.0

gemone avatar Jun 02 '23 05:06 gemone

If you could help us reproduce it on linux, I will be happy to look into this on higher priority.

mangalaman93 avatar Jun 05 '23 23:06 mangalaman93

I have the DB that caused this and will look into it more.

badger 2023/06/22 07:56:21 INFO: All 6 tables opened in 212ms badger 2023/06/22 07:56:21 INFO: Discard stats nextEmptySlot: 0 badger 2023/06/22 07:56:21 INFO: Set nextTxnTs to 1687428125010547400 badger 2023/06/22 07:56:21 INFO: Deleting empty file: .//keyValue\000276.vlog badger 2023/06/22 07:56:21 INFO: badger 2023/06/22 07:56:21 INFO: Level: 0. 0 B Size. 0 B Max. badger 2023/06/22 07:56:21 INFO: Level: 1. 0 B Size. 10 MiB Max. badger 2023/06/22 07:56:21 INFO: Level: 2. 0 B Size. 10 MiB Max. badger 2023/06/22 07:56:21 INFO: Level: 3. 0 B Size. 10 MiB Max. badger 2023/06/22 07:56:21 INFO: Level: 4. 0 B Size. 22 MiB Max. badger 2023/06/22 07:56:21 INFO: Level: 5. 240 MiB Size. 215 MiB Max. badger 2023/06/22 07:56:21 INFO: Level: 6. 2.1 GiB Size. 2.1 GiB Max. badger 2023/06/22 07:56:21 INFO: Attempting to compact with {level:5 score:1.71 adjusted:0 dropPrefixes:[] t:{baseLevel:0 targetSz:[] fileSz:[]}} badger 2023/06/22 07:56:21 WARNING: While running doCompact with {level:5 score:1.71 adjusted:0 dropPrefixes:[] t:{baseLevel:0 targetSz:[] fileSz:[]}}. Error: Unable to fill tables badger 2023/06/22 07:56:21 WARNING: While running doCompact with {level:5 score:1.71 adjusted:0 dropPrefixes:[] t:{baseLevel:0 targetSz:[] fileSz:[]}}. Error: Unable to fill tables badger 2023/06/22 07:56:21 WARNING: While running doCompact with {level:5 score:1.71 adjusted:0 dropPrefixes:[] t:{baseLevel:0 targetSz:[] fileSz:[]}}. Error: Unable to fill tables panic: Allocator can not allocate more than 64 buffers

goroutine 190 [running]: github.com/dgraph-io/ristretto/z.(*Allocator).addBufferAt(0xc00075df20?, 0x7ff789f87cc5?, 0xd123022ac0?) C:/Users/SamPlace/go/pkg/mod/github.com/dgraph-io/[email protected]/z/allocator.go:251 +0x1d9 github.com/dgraph-io/ristretto/z.(*Allocator).Allocate(0xc000494e00, 0x2298) C:/Users/SamPlace/go/pkg/mod/github.com/dgraph-io/[email protected]/z/allocator.go:302 +0x159 github.com/dgraph-io/badger/v3/table.(*Builder).compressData(0x0?, {0xd1006761f6, 0x1d8c, 0x9e0a}) C:/Users/SamPlace/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/table/builder.go:519 +0x99 github.com/dgraph-io/badger/v3/table.(*Builder).handleBlock(0xc0001b82d0) C:/Users/SamPlace/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/table/builder.go:169 +0xfc created by github.com/dgraph-io/badger/v3/table.NewTableBuilder C:/Users/SamPlace/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/table/builder.go:155 +0x28f

dgallion1 avatar Jun 22 '23 13:06 dgallion1

This code crashes and isn't caught. Same thing in Badger.

func main() {
	defer RecoverStackTrace()
	go func() {
		panic("death")
	}()
	select {}
}

I was able to catch the crash here. builder.go

func (b *Builder) handleBlock() {
	defer RecoverStackTrace()  <--- Catch it here now.
	defer b.wg.Done()

	doCompress := b.opts.Compression != options.None
	for item := range b.blockChan {

dgallion1 avatar Jun 22 '23 14:06 dgallion1

Is there any way to resolve this, this is becoming frequent in my case.

Akhilesh53 avatar Sep 20 '23 15:09 Akhilesh53

Is there any way to resolve this, this is becoming frequent in my case.

@Akhilesh53 My current approach is to store the cached state in the registry. If the Badger does not start properly, I will delete the damaged files and restart the Badger. For reference only.

From the source code, it seems that this error was not passed for processing, which makes it difficult for me to handle the error properly.

gemone avatar Sep 21 '23 01:09 gemone

But why badger is creating corrupted files and cannot read the same itself? I am these three errors continuously, mentioning below.


badger 2023/09/21 12:38:37 ERROR: Got error: While opening memtable: /data/dedupe/data/table_50/Pentagram/00067.mem error: while opening file: /data/dedupe/data/table_50/Pentagram/00067.mem error: open /data/dedupe/data/table_50/Pentagram/00067.mem: no such file or directory
unable to open: /data/dedupe/data/table_50/Pentagram/00067.mem
github.com/dgraph-io/ristretto/z.OpenMmapFile
    /data/badger_broadband/badger-cluster/vendor/github.com/dgraph-io/ristretto/z/file.go:80
github.com/dgraph-io/badger/v4.(*logFile).open
    /data/badger_broadband/badger-cluster/vendor/github.com/dgraph-io/badger/v4/memtable.go:554
github.com/dgraph-io/badger/v4.(*DB).openMemTable
    /data/badger_broadband/badger-cluster/vendor/github.com/dgraph-io/badger/v4/memtable.go:130
github.com/dgraph-io/badger/v4.(*DB).newMemTable
    /data/badger_broadband/badger-cluster/vendor/github.com/dgraph-io/badger/v4/memtable.go:151
github.com/dgraph-io/badger/v4.(*DB).ensureRoomForWrite
    /data/badger_broadband/badger-cluster/vendor/github.com/dgraph-io/badger/v4/db.go:995
github.com/dgraph-io/badger/v4.(*DB).writeRequests
    /data/badger_broadband/badger-cluster/vendor/github.com/dgraph-io/badger/v4/db.go:829
github.com/dgraph-io/badger/v4.(*DB).doWrites.func1
    /data/badger_broadband/badger-cluster/vendor/github.com/dgraph-io/badger/v4/db.go:884
runtime.goexit
    /usr/local/go/src/runtime/asm_amd64.s:1598 for id: 67
badger 2023/09/21 12:38:37 ERROR: writeRequests: writeRequests err: cannot create new mem table error: newMemTable error: While opening memtable: /data/dedupe/data/table_50/Pentagram/00067.mem error: while opening file: /data/dedupe/data/table_50/Pentagram/00067.mem error: open /data/dedupe/data/table_50/Pentagram/00067.mem: no such file or directory
unable to open: /data/dedupe/data/table_50/Pentagram/00067.mem
github.com/dgraph-io/ristretto/z.OpenMmapFile
    /data/badger_broadband/badger-cluster/vendor/github.com/dgraph-io/ristretto/z/file.go:80
github.com/dgraph-io/badger/v4.(*logFile).open
    /data/badger_broadband/badger-cluster/vendor/github.com/dgraph-io/badger/v4/memtable.go:554
github.com/dgraph-io/badger/v4.(*DB).openMemTable
    /data/badger_broadband/badger-cluster/vendor/github.com/dgraph-io/badger/v4/memtable.go:130
github.com/dgraph-io/badger/v4.(*DB).newMemTable
    /data/badger_broadband/badger-cluster/vendor/github.com/dgraph-io/badger/v4/memtable.go:151
github.com/dgraph-io/badger/v4.(*DB).ensureRoomForWrite
    /data/badger_broadband/badger-cluster/vendor/github.com/dgraph-io/badger/v4/db.go:995
github.com/dgraph-io/badger/v4.(*DB).writeRequests
    /data/badger_broadband/badger-cluster/vendor/github.com/dgraph-io/badger/v4/db.go:829
github.com/dgraph-io/badger/v4.(*DB).doWrites.func1
    /data/badger_broadband/badger-cluster/vendor/github.com/dgraph-io/badger/v4/db.go:884
runtime.goexit
    /usr/local/go/src/runtime/asm_amd64.s:1598
badger 2023/09/21 12:38:38 ERROR: Failure while flushing memtable to disk: error while creating table err: while creating table: /data/dedupe/data/table_50/Pentagram/000133.sst error: open /data/dedupe/data/table_50/Pentagram/000133.sst: no such file or directory
unable to open: /data/dedupe/data/table_50/Pentagram/000133.sst
github.com/dgraph-io/ristretto/z.OpenMmapFile
    /data/badger_broadband/badger-cluster/vendor/github.com/dgraph-io/ristretto/z/file.go:80
github.com/dgraph-io/badger/v4/table.CreateTable
    /data/badger_broadband/badger-cluster/vendor/github.com/dgraph-io/badger/v4/table/table.go:259
github.com/dgraph-io/badger/v4.(*DB).handleFlushTask
    /data/badger_broadband/badger-cluster/vendor/github.com/dgraph-io/badger/v4/db.go:1061
github.com/dgraph-io/badger/v4.(*DB).flushMemtable
    /data/badger_broadband/badger-cluster/vendor/github.com/dgraph-io/badger/v4/db.go:1083
github.com/dgraph-io/badger/v4.Open.func5
    /data/badger_broadband/badger-cluster/vendor/github.com/dgraph-io/badger/v4/db.go:354
runtime.goexit
    /usr/local/go/src/runtime/asm_amd64.s:1598. Retrying...
total time: cluster time for 50 took 13m33.967768745s

panic: cannot create new mem table error: newMemTable error: While opening memtable: /data/dedupe/data/table_50/Pentagram/00067.mem error: while opening file: /data/dedupe/data/table_50/Pentagram/00067.mem error: open /data/dedupe/data/table_50/Pentagram/00067.mem: no such file or directory
unable to open: /data/dedupe/data/table_50/Pentagram/00067.mem
github.com/dgraph-io/ristretto/z.OpenMmapFile
    /data/badger_broadband/badger-cluster/vendor/github.com/dgraph-io/ristretto/z/file.go:80
github.com/dgraph-io/badger/v4.(*logFile).open
    /data/badger_broadband/badger-cluster/vendor/github.com/dgraph-io/badger/v4/memtable.go:554
github.com/dgraph-io/badger/v4.(*DB).openMemTable
    /data/badger_broadband/badger-cluster/vendor/github.com/dgraph-io/badger/v4/memtable.go:130
github.com/dgraph-io/badger/v4.(*DB).newMemTable
    /data/badger_broadband/badger-cluster/vendor/github.com/dgraph-io/badger/v4/memtable.go:151
github.com/dgraph-io/badger/v4.(*DB).ensureRoomForWrite
    /data/badger_broadband/badger-cluster/vendor/github.com/dgraph-io/badger/v4/db.go:995
github.com/dgraph-io/badger/v4.(*DB).writeRequests
    /data/badger_broadband/badger-cluster/vendor/github.com/dgraph-io/badger/v4/db.go:829
github.com/dgraph-io/badger/v4.(*DB).doWrites.func1
    /data/badger_broadband/badger-cluster/vendor/github.com/dgraph-io/badger/v4/db.go:884
runtime.goexit
    /usr/local/go/src/runtime/asm_amd64.s:1598


goroutine 3487 [running]:
cluster/badger_cluster.(*Cluster).AddMemberAddress(0xc490f28df0, {​​​​​​​​0xbf7d98, 0xc000194000}​​​​​​​​​​​​​​​, 0x32?, 0x32, {​​​​​​​​​​​​​​​0xc48e35f800, 0x2f, 0x7f521b3908f0?}​​​​​​​​​​​​​​​)
    /data/badger_broadband/badger-cluster/badger_cluster/cluster.go:69 +0x276
main.makeCluster({​​​​​​​​​​​​​​​0xbf7d98, 0xc000194000}​​​​​​​​​​​​​​​, 0x32)
    /data/badger_broadband/badger-cluster/app/create_cluster.go:99 +0x3ea
main.worker(...)
    /data/badger_broadband/badger-cluster/app/create_cluster.go:59
main.main.func1()
    /data/badger_broadband/badger-cluster/app/create_cluster.go:45 +0x8e
created by main.main
    /data/badger_broadband/badger-cluster/app/create_cluster.go:42 +0x7e
  1. In Bytes() function , we are getting negative offset.

negative offset :: -1147080149 EOF negative offset :: -1147080153 negative offset :: -5116465645 EOF negative offset :: -5116465645 EOF negative offset :: -1147080149 EOF negative offset :: -1147080153 negative offset :: -5116465645 EOF

panic: runtime error: slice bounds out of range [-4169358568:] [recovered]
    panic: runtime error: slice bounds out of range [-4169358568:]
    panic: 
== Recovering from initIndex crash ==
File Info: [ID: 95, Size: 69121215, Zeros: 0]
isEnrypted: false checksumLen: 4238479779 
== Recovered ==

goroutine 6932842 [running]:
github.com/dgraph-io```
/badger/v4/table.(*Table).initBiggestAndSmallest.func1.1()
    /data/badger_broadband/badger_grpc/vendor/github.com/dgraph-io/badger/v4/table/table.go:353 +0xa8
panic({​​​​​​​​​​​​​​​0xba9420, 0xc4324db2d8}​​​​​​​​​​​​​​​)
    /usr/local/go/src/runtime/panic.go:884 +0x212
github.com/dgraph-io/ristretto/z.(*MmapFile).Bytes(...)
    /data/badger_broadband/badger_grpc/vendor/github.com/dgraph-io/ristretto/z/file.go:116
github.com/dgraph-io/badger/v4/table.(*Table).read(...)
    /data/badger_broadband/badger_grpc/vendor/github.com/dgraph-io/badger/v4/table/table.go:420
github.com/dgraph-io/badger/v4/table.(*Table).readNoFail(0xd0a740?, 0xc4a7b68810?, 0xbfe74d?)
    /data/badger_broadband/badger_grpc/vendor/github.com/dgraph-io/badger/v4/table/table.go:424 +0xd0
github.com/dgraph-io/badger/v4/table.(*Table).initBiggestAndSmallest.func1()
    /data/badger_broadband/badger_grpc/vendor/github.com/dgraph-io/badger/v4/table/table.go:383 +0x2ff
panic({​​​​​​​​​​​​​​​0xba9420, 0xc4324db2c0}​​​​​​​​​​​​​​​)
    /usr/local/go/src/runtime/panic.go:884 +0x212
github.com/dgraph-io/ristretto/z.(*MmapFile).Bytes(...)
    /data/badger_broadband/badger_grpc/vendor/github.com/dgraph-io/ristretto/z/file.go:116
github.com/dgraph-io/badger/v4/table.(*Table).read(...)
    /data/badger_broadband/badger_grpc/vendor/github.com/dgraph-io/badger/v4/table/table.go:420
github.com/dgraph-io/badger/v4/table.(*Table).readNoFail(0x6?, 0x0?, 0x9?)
    /data/badger_broadband/badger_grpc/vendor/github.com/dgraph-io/badger/v4/table/table.go:424 +0xd0
github.com/dgraph-io/badger/v4/table.(*Table).initIndex(0xc47e194180)
    /data/badger_broadband/badger_grpc/vendor/github.com/dgraph-io/badger/v4/table/table.go:445 +0x109
github.com/dgraph-io/badger/v4/table.(*Table).initBiggestAndSmallest(0xc47e194180)
    /data/badger_broadband/badger_grpc/vendor/github.com/dgraph-io/badger/v4/table/table.go:403 +0x7f
github.com/dgraph-io/badger/v4/table.OpenTable(0xc31e8f4160, {​​​​​​​​​​​​​​​0x0, 0x1, 0x1000000, 0xf33333, 0x0, 0x3f847ae147ae147b, 0x1000, 0x0, 0x1, ...}​​​​​​​​​​​​​​​)
    /data/badger_broadband/badger_grpc/vendor/github.com/dgraph-io/badger/v4/table/table.go:309 +0x278
github.com/dgraph-io/badger/v4/table.CreateTable({​​​​​​​​​​​​​​​0xc167146b80, 0x33}​​​​​​​​​​​​​​​, 0xccef5a4090)
    /data/badger_broadband/badger_grpc/vendor/github.com/dgraph-io/badger/v4/table/table.go:273 +0x305
github.com/dgraph-io/badger/v4.(*levelsController).subcompact.func4(0xccef5a4090, 0xc025c8b6e0?)
    /data/badger_broadband/badger_grpc/vendor/github.com/dgraph-io/badger/v4/levels.go:846 +0x11b
created by github.com/dgraph-io/badger/v4.(*levelsController).subcompact
    /data/badger_broadband/badger_grpc/vendor/github.com/dgraph-io/badger/v4/levels.go:836 +0x707
 

panic: runtime error: invalid memory address or nil pointer dereference
    panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x941247]


goroutine 2551 [running]:
github.com/dgraph-io/badger/v4/table.(*Table).offsetsLength(...)
    /data/badger_broadband/badger_grpc/vendor/github.com/dgraph-io/badger/v4/table/table.go:133
github.com/dgraph-io/badger/v4/table.(*Table).DecrRef(0xc67b5ca9c0)
    /data/badger_broadband/badger_grpc/vendor/github.com/dgraph-io/badger/v4/table/table.go:169 +0x87
github.com/dgraph-io/badger/v4.decrRefs(...)
    /data/badger_broadband/badger_grpc/vendor/github.com/dgraph-io/badger/v4/level_handler.go:178
github.com/dgraph-io/badger/v4.(*levelsController).compactBuildTables.func6()
    /data/badger_broadband/badger_grpc/vendor/github.com/dgraph-io/badger/v4/levels.go:955 +0x56
github.com/dgraph-io/badger/v4.(*levelsController).runCompactDef.func1()
    /data/badger_broadband/badger_grpc/vendor/github.com/dgraph-io/badger/v4/levels.go:1431 +0x25
panic({​​​​​​​​​​​0xb11c60, 0x11587a0}​​​​​​​​​​​)
    /usr/local/go/src/runtime/panic.go:884 +0x213
github.com/dgraph-io/badger/v4/fb.(*TableIndex).StaleDataSize(0x0)
    /data/badger_broadband/badger_grpc/vendor/github.com/dgraph-io/badger/v4/fb/TableIndex.go:132 +0x1d
github.com/dgraph-io/badger/v4/table.(*Table).StaleDataSize(0x0?)
    /data/badger_broadband/badger_grpc/vendor/github.com/dgraph-io/badger/v4/table/table.go:667 +0x1e
github.com/dgraph-io/badger/v4.(*levelHandler).addSize(...)
    /data/badger_broadband/badger_grpc/vendor/github.com/dgraph-io/badger/v4/level_handler.go:214
github.com/dgraph-io/badger/v4.(*levelHandler).replaceTables(0xc242f49140, {​​​​​​​​​​​0xc48d01a248, 0x1, 0x4208767?}​​​​​​​​​​​, {​​​​​​​​​​​0xc6fd7f5a98, 0x1, 0xc48d02b800?}​​​​​​​​​​​)
    /data/badger_broadband/badger_grpc/vendor/github.com/dgraph-io/badger/v4/level_handler.go:137 +0x2b3
github.com/dgraph-io/badger/v4.(*levelsController).runCompactDef(_, _, _, {​​​​​​​​​​​0xc48bea9250, 0x1, {​​​​​​​​​​​0x4, {​​​​​​​​​​​0xc48d02b7c0, 0x7, 0x7}​​​​​​​​​​​, {​​​​​​​​​​​0xc48d02b800, ...}​​​​​​​​​​​}​​​​​​​​​​​, ...}​​​​​​​​​​​)
    /data/badger_broadband/badger_grpc/vendor/github.com/dgraph-io/badger/v4/levels.go:1460 +0x43b
github.com/dgraph-io/badger/v4.(*levelsController).doCompact(0xc116de2a80, 0x1, {​​​​​​​​​​​0x5, 0x3ffd1ba93016f62f, 0x3ffd1ba93016f62f, {​​​​​​​​​​​0x0, 0x0, 0x0}​​​​​​​​​​​, {​​​​​​​​​​​0x4, {​​​​​​​​​​​0xc48d02b7c0, ...}​​​​​​​​​​​, ...}​​​​​​​​​​​}​​​​​​​​​​​)
    /data/badger_broadband/badger_grpc/vendor/github.com/dgraph-io/badger/v4/levels.go:1544 +0x538
github.com/dgraph-io/badger/v4.(*levelsController).runCompactor.func2({​​​​​​​​​​​0x5, 0x3ffd1ba93016f62f, 0x3ffd1ba93016f62f, {​​​​​​​​​​​0x0, 0x0, 0x0}​​​​​​​​​​​, {​​​​​​​​​​​0x4, {​​​​​​​​​​​0xc48d02b7c0, 0x7, 0x7}​​​​​​​​​​​, ...}​​​​​​​​​​​}​​​​​​​​​​​)
    /data/badger_broadband/badger_grpc/vendor/github.com/dgraph-io/badger/v4/levels.go:465 +0x78
github.com/dgraph-io/badger/v4.(*levelsController).runCompactor.func3()
    /data/badger_broadband/badger_grpc/vendor/github.com/dgraph-io/badger/v4/levels.go:488 +0x158
github.com/dgraph-io/badger/v4.(*levelsController).runCompactor(0xc116de2a80, 0x1, 0xc2026ffef0)
    /data/badger_broadband/badger_grpc/vendor/github.com/dgraph-io/badger/v4/levels.go:517 +0x382
created by github.com/dgraph-io/badger/v4.(*levelsController).startCompact
    /data/badger_broadband/badger_grpc/vendor/github.com/dgraph-io/badger/v4/levels.go:354 +0x4e
 

Akhilesh53 avatar Sep 21 '23 08:09 Akhilesh53

I also encountered this on v3.2103.2 while writing unit tests on macOS that simulate a power loss and corruption on an embedded device. The following app causes a panic.

I think this could be handled better by adding bounds checking to table.(*Table).initIndex(). If we check readPos for a negative value before calling table.(*Table).readNoFail(), then we could return an error that would bubble up to Open(). Then the caller could handle the error instead of causing an unrecoverable panic.

https://github.com/dgraph-io/badger/blob/v3.2103.2/table/table.go#L430

func main() {
	dirpath := "/tmp/badger-issue-1883"

	// Clean environment
	err := os.RemoveAll(dirpath)
	if err != nil {
		log.Fatal(err)
	}

	// Open new database
	db, err := badger.Open(badger.DefaultOptions(dirpath))
	if err != nil {
		log.Fatal(err)
	}

	// Write to database
	err = db.Update(func(txn *badger.Txn) error {
		e := badger.NewEntry([]byte("key"), []byte("value"))
		if err := txn.SetEntry(e); err != nil {
			return err
		}
		return nil
	})
	if err != nil {
		log.Fatal(err)
	}

	// Close database
	db.Close()

	// Corrupt sst file
	filepath := path.Join(dirpath, "000001.sst")
	os.WriteFile(filepath, []byte("blah"), 0644)

	// Reopen database
	db, err = badger.Open(badger.DefaultOptions(dirpath))
	if err != nil {
		log.Fatal(err)
	}
}
badger 2023/09/26 16:00:47 INFO: All 0 tables opened in 0s
badger 2023/09/26 16:00:47 INFO: Discard stats nextEmptySlot: 0
badger 2023/09/26 16:00:47 INFO: Set nextTxnTs to 0
badger 2023/09/26 16:00:47 INFO: Lifetime L0 stalled for: 0s
badger 2023/09/26 16:00:47 INFO: 
Level 0 [ ]: NumTables: 01. Size: 171 B of 0 B. Score: 0.00->0.00 StaleData: 0 B Target FileSize: 64 MiB
Level 1 [ ]: NumTables: 00. Size: 0 B of 10 MiB. Score: 0.00->0.00 StaleData: 0 B Target FileSize: 2.0 MiB
Level 2 [ ]: NumTables: 00. Size: 0 B of 10 MiB. Score: 0.00->0.00 StaleData: 0 B Target FileSize: 2.0 MiB
Level 3 [ ]: NumTables: 00. Size: 0 B of 10 MiB. Score: 0.00->0.00 StaleData: 0 B Target FileSize: 2.0 MiB
Level 4 [ ]: NumTables: 00. Size: 0 B of 10 MiB. Score: 0.00->0.00 StaleData: 0 B Target FileSize: 2.0 MiB
Level 5 [ ]: NumTables: 00. Size: 0 B of 10 MiB. Score: 0.00->0.00 StaleData: 0 B Target FileSize: 2.0 MiB
Level 6 [B]: NumTables: 00. Size: 0 B of 10 MiB. Score: 0.00->0.00 StaleData: 0 B Target FileSize: 2.0 MiB
Level Done
badger 2023/09/26 16:00:47 INFO: All 1 tables opened in 0s
badger 2023/09/26 16:00:47 INFO: Discard stats nextEmptySlot: 0
panic: runtime error: slice bounds out of range [-1651269992:] [recovered]
	panic: runtime error: slice bounds out of range [-1651269992:]
	panic: 
== Recovering from initIndex crash ==
File Info: [ID: 1, Size: 4, Zeros: 0]
isEnrypted: false checksumLen: 1651269992 
== Recovered ==


goroutine 36 [running]:
github.com/dgraph-io/badger/v3/table.(*Table).initBiggestAndSmallest.func1.1()
	/Users/user/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/table/table.go:351 +0x9c
panic({0x104ecb380, 0x1400002e180})
	/opt/homebrew/Cellar/go/1.20.7/libexec/src/runtime/panic.go:884 +0x204
github.com/dgraph-io/ristretto/z.(*MmapFile).Bytes(...)
	/Users/user/go/pkg/mod/github.com/dgraph-io/[email protected]/z/file.go:116
github.com/dgraph-io/badger/v3/table.(*Table).read(...)
	/Users/user/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/table/table.go:418
github.com/dgraph-io/badger/v3/table.(*Table).readNoFail(0x104f005a8?, 0x140001e5d40?, 0x104d579df?)
	/Users/user/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/table/table.go:422 +0xe0
github.com/dgraph-io/badger/v3/table.(*Table).initBiggestAndSmallest.func1()
	/Users/user/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/table/table.go:381 +0x27c
panic({0x104ecb380, 0x1400002e168})
	/opt/homebrew/Cellar/go/1.20.7/libexec/src/runtime/panic.go:884 +0x204
github.com/dgraph-io/ristretto/z.(*MmapFile).Bytes(...)
	/Users/user/go/pkg/mod/github.com/dgraph-io/[email protected]/z/file.go:116
github.com/dgraph-io/badger/v3/table.(*Table).read(...)
	/Users/user/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/table/table.go:418
github.com/dgraph-io/badger/v3/table.(*Table).readNoFail(0x1400009d948?, 0x10497f140?, 0x1400009d948?)
	/Users/user/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/table/table.go:422 +0xe0
github.com/dgraph-io/badger/v3/table.(*Table).initIndex(0x1400015a240)
	/Users/user/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/table/table.go:443 +0xf0
github.com/dgraph-io/badger/v3/table.(*Table).initBiggestAndSmallest(0x1400015a240)
	/Users/user/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/table/table.go:401 +0x5c
github.com/dgraph-io/badger/v3/table.OpenTable(0x140000cc140, {0x0, 0x1, 0x200000, 0x0, 0x0, 0x3f847ae147ae147b, 0x1000, 0x0, 0x1, ...})
	/Users/user/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/table/table.go:308 +0x248
github.com/dgraph-io/badger/v3.newLevelsController.func1({0x140002ee150, 0x21}, {0x0?, 0x0?, 0x0?})
	/Users/user/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/levels.go:150 +0x198
created by github.com/dgraph-io/badger/v3.newLevelsController
	/Users/user/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/levels.go:129 +0x488
make[1]: *** [run] Error 2
make: *** [run] Error 2

sheabot avatar Sep 26 '23 22:09 sheabot

I often encounter this problem. The database will be damaged when the Windows system is physically powered off or has a blue screen. There should be data that has not been written to the disk.

peiyangli avatar Oct 13 '23 03:10 peiyangli

Hey guys I wrote a draft PR on a badger fork for internal use on the company where I work on, and on my manual tests it fixed this issue by allowing the caller of the Open() function to handle panics in an appropriate way:

https://github.com/VinGarcia/badger/pull/1

@mangalaman93 if you think this solution looks promising I can polish it up, add some tests and create a proper PR for the official badger repo.

Let me know what you think.

Oh and I can't share the database I am using because it contains info private to my current company, but I will try to produce a broken database by repeatedly forcibly killing a Go process that is reading and writing to a sample database. I will try to get it done tonight.

~[edit]: The panics are being forwarded correctly but I found an issue so the PR is not working properly yet, I am working on it [/edit]~ it's fixed now I think.

VinGarcia avatar Dec 06 '23 14:12 VinGarcia

@mangalaman93 I just created a new PR based on badger V4:

https://github.com/dgraph-io/badger/pull/2033

VinGarcia avatar Dec 07 '23 14:12 VinGarcia

I could reproduce the panic by replacing an *.sst file with an empty *.sst. It would be great to have this fixed, so we can react when a database is corrupt

panagiotis-bitharis avatar Feb 08 '24 12:02 panagiotis-bitharis

This issue has been stale for 60 days and will be closed automatically in 7 days. Comment to keep it open.

github-actions[bot] avatar Jul 19 '24 02:07 github-actions[bot]