certificates icon indicating copy to clipboard operation
certificates copied to clipboard

[Bug]: On start up exiting with "Value log truncate required to run DB. This might result in data loss" error

Open aburgett87 opened this issue 2 years ago • 8 comments

Steps to Reproduce

With an instance of step-ca already running, start up a new instance

Your Environment

  • OS - docker (linux/amd64)
  • step-ca Version - 0.23.0

Expected Behavior

The new instance loads successfully

Actual Behavior

The new instance exists with the following logs

badger 2023/01/11 23:40:54 INFO: All 1 tables opened in 6ms
badger 2023/01/11 23:40:54 INFO: Replaying file id: 0 at offset: 291060
badger 2023/01/11 23:40:54 WARNING: Truncate Needed. File /home/step/db/000000.vlog size: 296890 Endoffset: 291060
Error opening database of Type badgerv2 with source /home/step/db: error opening Badger database: During db.vlog.open: Value log truncate required to run DB. This might result in data loss

Additional Context

This occurs when performing a blue green deployment

The instance that's running has the following logs

badger 2023/01/11 04:00:02 INFO: All 1 tables opened in 4ms
badger 2023/01/11 04:00:02 INFO: Replaying file id: 0 at offset: 4379
badger 2023/01/11 04:00:02 INFO: Replay took: 10.556853ms
2023/01/11 04:00:02 Starting Smallstep CA/0.23.0 (linux/amd64)
2023/01/11 04:00:02 Documentation: https://u.step.sm/docs/ca
2023/01/11 04:00:02 Community Discord: https://u.step.sm/discord
2023/01/11 04:00:02 Config file: /home/step/config/ca.json
2023/01/11 04:00:02 The primary server URL is **REDACTED**
2023/01/11 04:00:02 Root certificates are available at **REDACTED**
2023/01/11 04:00:02 X.509 Root Fingerprint: 6f2a88a4e76fc8e886fe5db652c6e14f6bff34e46d29262e4e13fa3d09a1543a
2023/01/11 04:00:02 Serving HTTPS on :9000 ...
time="2023-01-11T13:34:24Z" level=info certificate=**REDACTED**
time="2023-01-11T13:44:18Z" level=info certificate=**REDACTED**
time="2023-01-11T19:30:27Z" level=info certificate=**REDACTED**
time="2023-01-11T22:04:11Z" level=info duration="182.506µs" duration-ns=182506 fields.time="2023-01-11T22:04:11Z" method=GET name=ca path="/provisioners?limit=100" protocol=HTTP/2.0 referer= remote-address=10.253.6.252 request-id=cevj4mtov6f7al6khei0 size=869 status=200 user-agent="Smallstep CLI/0.23.0 (linux/amd64)" user-id=
time="2023-01-11T22:04:11Z" level=info duration="88.074µs" duration-ns=88074 fields.time="2023-01-11T22:04:11Z" method=GET name=ca path=/provisioners/WOKl_cXXYe1FL3VfpNqg72jCDrOW_jbwc1D3KxIaan0/encrypted-key protocol=HTTP/2.0 referer= remote-address=10.253.6.252 request-id=cevj4mtov6f7al6kheig size=586 status=200 user-agent="Smallstep CLI/0.23.0 (linux/amd64)" user-id=
time="2023-01-11T22:04:13Z" level=info duration="86.383µs" duration-ns=86383 fields.time="2023-01-11T22:04:13Z" method=GET name=ca path=/root/6f2a88a4e76fc8e886fe5db652c6e14f6bff34e46d29262e4e13fa3d09a1543a protocol=HTTP/2.0 referer= remote-address=10.253.6.252 request-id=cevj4ndov6f7al6khej0 size=764 status=200 user-agent="Smallstep CLI/0.23.0 (linux/amd64)" user-id=
time="2023-01-11T22:04:13Z" level=info certificate=**REDACTED**

Contributing

Vote on this issue by adding a 👍 reaction. To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).

aburgett87 avatar Jan 11 '23 23:01 aburgett87

Appears to be related https://github.com/dgraph-io/badger/issues/1353

aburgett87 avatar Jan 12 '23 00:01 aburgett87

Hey @aburgett87 👋 . Thanks for opening the issue! And apologies that it's taken so long to get back to you.

A bit of context - Badger is the default database used by step-ca, but it's not the database we recommend when considering a production level deploy (https://smallstep.com/docs/step-ca/certificate-authority-server-production#high-availability). We recommend more robust databases because we don't have capacity to address and assist with database specific issues.

For this specific case, as you linked in your comment, this appears to be an issue specific to Badger DB. I think our questions is, what are your expectations from step-ca here? (Serious question - there's a number of things that step-ca could do, but we've generally tried to stay clear of assisting with DB management).

dopey avatar Feb 02 '23 21:02 dopey

Hi @dopey No worries. That's understandable and I'm planning to move to a postgres SQL database ASAP. I suppose some information or guidance what I should do to clear it would be suffice. To get the services from bouncing up and down I've removed the database, let it re-create it and everything appears fine. I don't have a clue about the ramifications of doing that.

aburgett87 avatar Feb 06 '23 04:02 aburgett87

What might help:

I have a go-Program that reads the database and lists the certificates in openssl index.txt format.

While this was read-only, and step is not coming up and does not lock the file, I have added the truncate-option (.WithTruncate(true)) to this program. Tada, this cleans up the mess. After that step will come up again.

I have not fully verified what was lost, however I have the last created certificates from two days ago. Full backup was 3 days old, so fixed.

func main() {
        db, err := badger.Open(badger.DefaultOptions(os.Args[1]).WithTruncate(true))
        if err != nil {
                panic(err)
        }
        defer db.Close()

        listCerts(db, []byte("x509_certs"))
        //      listRevoked(db, []byte("revoked_x509_certs"))
}

logopk avatar Sep 15 '23 12:09 logopk

On windows I can't use step-ca because after every restart this error comes up. CA is not useable anymore.

Marvo2011 avatar Jan 25 '24 07:01 Marvo2011

@Marvo2011 this sounds like a case of https://github.com/dgraph-io/badger/issues/744. The issue mentions truncating the DB helps, but that's not what we should make it do by default (I think?). I also see a mention of properly shutting down the process and closing the DB, so maybe that's not happening. It's likely that this shutdown needs something specific on Windows, because I'm starting/stopping CAs all the time on macOS, and I don't see the same issue. That said, it can also happen due to an unexpected stop of the CA, so we may want this for other OSs too.

We might be able to do something similar as done in https://github.com/decred/dcrdex/pull/2388, e.g.:

db, err := badger.Open(opts)
if errors.Is(err, badger.ErrTruncateNeeded) {
    // https://github.com/dgraph-io/badger/issues/744
    log.Warnf("NewFileDB badger db: %v", err)
    // Try again with value log truncation enabled.
    opts.Truncate = true
    log.Warnf("Attempting to reopen badger DB with the Truncate option set...")
    db, err = badger.Open(opts)
}

Short term solution would be to use a different DB backend, but I don't know if that's an option for you.

hslatman avatar Jan 25 '24 08:01 hslatman

@hslatman Thanks, yes seams related to Windows. I moved to a debian system here start/stop works fine. A different DB backend is also a solution which are supported?

Marvo2011 avatar Jan 25 '24 10:01 Marvo2011

@Marvo2011 you can try the MySQL and Postgres ones. Admittedly, they are a bit "heavier" to use than badger, but they shouldn't have this issue.

hslatman avatar Jan 25 '24 10:01 hslatman