dolt icon indicating copy to clipboard operation
dolt copied to clipboard

ERRO[6058] stats executor errorserialQueue panicked running work: assertion failed: empty chunk returned from ChunkStore

Open robertabcd opened this issue 8 months ago • 15 comments

This error message is showing up occasionally

ERRO[6058] stats executor errorserialQueue panicked running work: assertion failed: empty chunk returned from ChunkStore

when running the following workload against dolt sql-server:

  1. doing a lot of batch inserts
  2. add then commit
  3. checkout different branch
  4. doing a lot of batch inserts
  5. add then commit
  6. checkout the previous branch
  7. repeat from step 1

The binary was built from commit 8855fb98e727d56c98780dc66f28e9e704841c64 with the following debug code added:

diff --git a/go/libraries/doltcore/sqle/statspro/jobqueue/serialqueue.go b/go/libraries/doltcore/sqle/statspro/jobqueue/serialqueue.go
index f0cfd62a94..6d4f7620fc 100644
--- a/go/libraries/doltcore/sqle/statspro/jobqueue/serialqueue.go
+++ b/go/libraries/doltcore/sqle/statspro/jobqueue/serialqueue.go
@@ -18,6 +18,7 @@ import (
        "context"
        "errors"
        "fmt"
+       "runtime/debug"
        "sync"
        "sync/atomic"
        "time"
@@ -429,7 +430,7 @@ func (s *SerialQueue) runRunner(ctx context.Context) {
                                var err error
                                defer func() {
                                        if r := recover(); r != nil {
-                                               err = fmt.Errorf("serialQueue panicked running work: %s", r)
+                                               err = fmt.Errorf("serialQueue panicked running work: %s\n%s", r, string(debug.Stack()))
                                        }
                                        if err != nil {
                                                s.errCb(err)

Error log with stack trace:

ERRO[6058] stats executor errorserialQueue panicked running work: assertion failed: empty chunk returned from ChunkStore
goroutine 65 [running]:
runtime/debug.Stack()
        /usr/local/go/src/runtime/debug/stack.go:26 +0x5e
github.com/dolthub/dolt/go/libraries/doltcore/sqle/statspro/jobqueue.(*SerialQueue).runRunner.func1.1()
        /dolt/go/libraries/doltcore/sqle/statspro/jobqueue/serialqueue.go:433 +0x4c
panic({0x2c48280?, 0xc04c3d5ba0?})
        /usr/local/go/src/runtime/panic.go:792 +0x132
github.com/dolthub/dolt/go/store/prolly/tree.assertTrue(...)
        /dolt/go/store/prolly/tree/node_cursor.go:634
github.com/dolthub/dolt/go/store/prolly/tree.nodeStore.Read({{0x3bf55a8, 0xc00200a078}, {{0xc0018d8f90, 0xc0018d8ff0, 0xc0018d9050, 0xc0018d90b0, 0xc0018d9110, 0xc0018d9170, 0xc0018d91d0, 0xc0018d9230, ...}}, ...}, ...)
        /dolt/go/store/prolly/tree/node_store.go:103 +0x646
github.com/dolthub/dolt/go/store/prolly/tree.fetchChild(...)
        /dolt/go/store/prolly/tree/node_cursor.go:629
github.com/dolthub/dolt/go/store/prolly/tree.newCursorFromSearchFn({0x3b59de0, 0xc001fda3c0}, {0x3be7e30, 0xc001b72000}, {{0x98, 0x181, 0x8c, 0x6, 0x0, 0x0}, ...}, ...)
        /dolt/go/store/prolly/tree/node_cursor.go:171 +0x3d9
github.com/dolthub/dolt/go/store/prolly/tree.newCursorAtOrdinal({0x3b59de0, 0xc001fda3c0}, {0x3be7e30, 0xc001b72000}, {{0x98, 0x181, 0x8c, 0x6, 0x0, 0x0}, ...}, ...)
        /dolt/go/store/prolly/tree/node_cursor.go:105 +0x185
github.com/dolthub/dolt/go/store/prolly/tree.StaticMap[...].IterOrdinalRange(0x3c13500, {0x3b59de0, 0xc001fda3c0}, 0x4e22d5, 0x4e35f4)
        /dolt/go/store/prolly/tree/map.go:380 +0x116
github.com/dolthub/dolt/go/store/prolly.Map.IterOrdinalRange(...)
        /dolt/go/store/prolly/tuple_map.go:293
github.com/dolthub/dolt/go/libraries/doltcore/sqle/statspro.(*StatsController).collectIndexNodes.func2()
        /dolt/go/libraries/doltcore/sqle/statspro/worker.go:347 +0x3be
github.com/dolthub/dolt/go/libraries/doltcore/sqle/statspro/jobqueue.(*SerialQueue).DoSyncSessionAware.func1()
        /dolt/go/libraries/doltcore/sqle/statspro/jobqueue/serialqueue.go:272 +0xd5
github.com/dolthub/dolt/go/libraries/doltcore/sqle/statspro/jobqueue.(*SerialQueue).runRunner.func1(0xc001d85f48?, {0xc042ff2690?, 0xc06c74a460?, 0x0?})
        /dolt/go/libraries/doltcore/sqle/statspro/jobqueue/serialqueue.go:439 +0x62
github.com/dolthub/dolt/go/libraries/doltcore/sqle/statspro/jobqueue.(*SerialQueue).runRunner(0xc0019ff8f0, {0x3b59a90, 0xc001fe0d70})
        /dolt/go/libraries/doltcore/sqle/statspro/jobqueue/serialqueue.go:440 +0xcd
github.com/dolthub/dolt/go/libraries/doltcore/sqle/statspro/jobqueue.(*SerialQueue).Run.func2()
        /dolt/go/libraries/doltcore/sqle/statspro/jobqueue/serialqueue.go:167 +0x4f
created by github.com/dolthub/dolt/go/libraries/doltcore/sqle/statspro/jobqueue.(*SerialQueue).Run in goroutine 63
        /dolt/go/libraries/doltcore/sqle/statspro/jobqueue/serialqueue.go:165 +0x16e

robertabcd avatar Apr 07 '25 04:04 robertabcd

Might be related to #9001, but no longer a panic.

robertabcd avatar Apr 07 '25 04:04 robertabcd

@max-hoffman will look at this today. Is this on 1.51.0+? We had a major stats refactor just land.

timsehn avatar Apr 07 '25 14:04 timsehn

I see. You built your own that should be > 1.51.0

timsehn avatar Apr 07 '25 14:04 timsehn

We report this error when a counting bug leads to us reading past the end of the tree...trying to find the arithmetic issue now.

max-hoffman avatar Apr 07 '25 16:04 max-hoffman

@robertabcd I have a test loop that's alternating between:

  1. writes batch of inserts to branchA, commit
  2. write batch of inserts to branchB, commit

I am having a hard time reproing. Are there any other details you can provide that might help me? Schema, multiple databases, volume of writes, non-standard indexes,... etc. If you're comfortable instrumenting on your side, the thing I would do if I have a repro is (1) catch the panic inside IterOrdinalRange and (2) print the tree size and the start node leading to the panic. If those numbers look OK I would move up to printing the specific database/branch/tablename/index.

max-hoffman avatar Apr 07 '25 17:04 max-hoffman

@max-hoffman Here is a strip down version of the workload. I can repro this 3 out of 3 runs on my machine. (can't upload here as .py is not allowed) https://gist.github.com/robertabcd/8778a6d81bd262d73d2be0496119e669

To run:

  • Install PyMySQL version 1.1.1

Then:

$ python stats-repro.py init
$ ( cd test.doltdb/ && dolt sql-server ) &
$ python stats-repro.py run

It starts happening on my machine after ~60min (on SSD) when it logs connectionID around 50-100.

robertabcd avatar Apr 08 '25 02:04 robertabcd

@robertabcd Thank you for the thorough repro! I am running the script, hopefully I'll start seeing the error and can root cause in a few hours.

max-hoffman avatar Apr 08 '25 15:04 max-hoffman

It might take me a bit longer than an hour, each flush takes me 6 minutes and it's adding 30 seconds every flush.

Stats can be turned off for an active server with call dolt_stats_purge(), and you can prevent a server from initializing stats with SET @@PERSIST.dolt_stats_enable = 0 in the meatime.

max-hoffman avatar Apr 08 '25 18:04 max-hoffman

The update on this is, I think this is a rare GC-stats interaction, where GC removes a chunk that it's deemed unreachable before stats tries to read it. We made some optimizations to let GC finalization not block on stats completing, which can take many minutes, but this seems to be an unexpected tradeoff. I am trying to understand whether I should (1) make GC visiting checks more active or (2) add defensive code at stats read time. In the meantime, calling DOLT_GC after every commit is maybe more often than is necessary depending on your exact use case. We have online GC that should take care of cleanup, but will probably run ~1000x less often.

max-hoffman avatar Apr 09 '25 15:04 max-hoffman

@max-hoffman Thank you for the insights. I'll looking into disabling stats and not calling GC manually. Do you know if this error has any impact on the correctness of the data outside of stats?

robertabcd avatar Apr 11 '25 04:04 robertabcd

@robertabcd This panic has no impact on data correctness. I am still working on trying to plug the hole, hopefully will have a fix soon.

max-hoffman avatar Apr 11 '25 14:04 max-hoffman

@robertabcd This PR appears to stabilize the script https://github.com/dolthub/dolt/pull/9106. I've been running it for awhile and the throughput is much higher, and I have not seen panics yet. Working the change through CI now.

max-hoffman avatar Apr 11 '25 18:04 max-hoffman

It's been running smoothly for a couple hours now, though I'm still seeing a spike in latency that I can't explain

flush done in 44.35 seconds
flush done in 43.91 seconds
flush done in 537.55 seconds
flush done in 550.21 seconds
flush done in 602.78 seconds
flush done in 618.09 seconds
flush done in 568.94 seconds
flush done in 602.48 seconds
flush done in 2029.91 seconds
flushing dolt database

max-hoffman avatar Apr 11 '25 20:04 max-hoffman

The fix PR is almost ready, waiting on a colleague on vacation to sign off when he's back Wednesday.

I was running this over the weekend and some of the cycles took up to 10 hours, up from 40 seconds originally. As far as i can tell none of that latency is from the original import or stats after letting GC interrupt stats. It's possible there are other import issues, but to the best of my understanding repeated calls to call dolt_gc() without a full GC (call dolt_gc('-full')) accumulate many small tablefiles yet to be compacted.

I am running a full GC to see whether compacting many small tablefiles restores the original write throughput. It might take overnight to run.

max-hoffman avatar Apr 14 '25 21:04 max-hoffman

I'm seeing one other notable thing related to GC. Every dolt_gc adds a new tablefile to our oldgen store. Each tablefile in oldgen is indexed/accessed separately from the others, and the indexing strategy is cheaper/slower than the active append only journal file. A pattern where we have 100's of oldgen tablefiles means every write has to access each of these files. In a big database I think this means potentially hundreds of disk seeks per insert.

So when I compact hundreds of tablefiles down to 1 with call dolt_gc('-full;), the write throughput increases dramatically. The GC time also goes back to being fast for a short period of time. Disabling stats/GC doesn't noticeably affect the throughput.

This is worse on my system because I'm using an HDD, but you're going to see similar throughput throttling on an SSD. I would stick with disabling your GC calls for now, or do a call dolt_gc('-full') periodically when throughput starts suffering (the full GC time also scales poorly with tablefile number at the moment).

max-hoffman avatar Apr 15 '25 21:04 max-hoffman