vince Docker container crash

Hi there! I've just seen that the docker container has crashed. Here is the error I got when I do a docker logs :

{"time":"2024-06-27T03:43:28.210460272Z","level":"INFO","msg":"creating new snapshot","component":"snapshot","path":"/app/data/snapshots/snapshots/13-9206-1719459808210.tmp"}
2024-06-27T03:43:28.404Z [ERROR] raft: failed to take snapshot: error="failed to persist snapshot: arrow/ipc: tried to write record batch with different schema"
2024-06-27T03:46:14.454Z [INFO]  raft: starting snapshot up to: index=9210
{"time":"2024-06-27T03:46:14.454634112Z","level":"INFO","msg":"creating new snapshot","component":"snapshot","path":"/app/data/snapshots/snapshots/13-9210-1719459974454.tmp"}
2024-06-27T03:46:14.646Z [ERROR] raft: failed to take snapshot: error="failed to persist snapshot: arrow/ipc: tried to write record batch with different schema"
2024-06-27T03:48:25.736Z [INFO]  raft: starting snapshot up to: index=9212
{"time":"2024-06-27T03:48:25.736896922Z","level":"INFO","msg":"creating new snapshot","component":"snapshot","path":"/app/data/snapshots/snapshots/13-9212-1719460105736.tmp"}
2024-06-27T03:48:25.906Z [ERROR] raft: failed to take snapshot: error="failed to persist snapshot: arrow/ipc: tried to write record batch with different schema"
2024-06-27T03:52:12.032Z [INFO]  raft: starting snapshot up to: index=9219
{"time":"2024-06-27T03:52:12.032933328Z","level":"INFO","msg":"creating new snapshot","component":"snapshot","path":"/app/data/snapshots/snapshots/13-9219-1719460332032.tmp"}
{"time":"2024-06-27T03:52:12.085446794Z","level":"INFO","msg":"exiting compaction loop","component":"lsm-tree"}
panic: arrow/array: field 15 has 18410 rows. want=18412
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x50 pc=0xec2544]

goroutine 105 [running]:
github.com/vinceanalytics/vince/internal/lsm.(*PartStore).Compact.func1()
	/home/runner/work/vince/vince/internal/lsm/part.go:114 +0xa4
panic({0x18a4600?, 0xc01402a550?})
	/opt/hostedtoolcache/go/1.22.0/x64/src/runtime/panic.go:770 +0x132
github.com/apache/arrow/go/v15/arrow/array.(*RecordBuilder).NewRecord(0xc007530180)
	/home/runner/go/pkg/mod/github.com/apache/arrow/go/[email protected]/arrow/array/record.go:349 +0x28e
github.com/vinceanalytics/vince/internal/staples.(*Merger).NewRecord(0xc0075565e8?, {{0x0, 0x0, 0x0}, {0x0, 0x0, 0x0}})
	/home/runner/work/vince/vince/internal/staples/arrow.go:176 +0x67
github.com/vinceanalytics/vince/internal/lsm.(*PartStore).Compact(0xc0075301c0, 0xc00d579e58)
	/home/runner/work/vince/vince/internal/lsm/part.go:137 +0x145
github.com/vinceanalytics/vince/internal/lsm.(*Tree).Compact(0xc0000ea100, 0x0)
	/home/runner/work/vince/vince/internal/lsm/lsm.go:268 +0xd6
github.com/vinceanalytics/vince/internal/lsm.(*Tree).Start(0xc0000ea100, {0x1d4d048, 0xc007530c80})
	/home/runner/work/vince/vince/internal/lsm/lsm.go:240 +0x288
created by github.com/vinceanalytics/vince/internal/session.(*Session).Start in goroutine 1
	/home/runner/work/vince/vince/internal/session/session.go:139 +0x85
{"time":"2024-06-27T11:54:59.758964586Z","level":"INFO","msg":"Opening store","component":"store","nodeId":"1","listening":"localhost:4002"}
{"time":"2024-06-27T11:54:59.773558502Z","level":"INFO","msg":"Setup log info","component":"store","firstIdxOnOpen":1,"lastIdxOnOpen":9219,"lastAppliedIdxOnOpen":0,"lastCommandIdxOnOpen":0}
{"time":"2024-06-27T11:54:59.782291961Z","level":"WARN","msg":"found temporary snapshot","component":"snapshot","name":"13-9219-1719460332032.tmp"}
2024-06-27T11:55:00.400Z [INFO]  raft: initial configuration: index=1 servers="[{Suffrage:Voter ID:1 Address:localhost:4002}]"
{"time":"2024-06-27T11:55:00.40124848Z","level":"INFO","msg":"starting server","addr":":8080"}

Jun 27 '24 11:06 Ziedelth

Another crash:

{"time":"2024-06-27T15:32:27.471176779Z","level":"INFO","msg":"creating new snapshot","component":"snapshot","path":"/app/data/snapshots/snapshots/14-9651-1719502347471.tmp"}
2024-06-27T15:32:27.674Z [ERROR] raft: failed to take snapshot: error="failed to persist snapshot: arrow/ipc: tried to write record batch with different schema"
2024-06-27T15:35:00.319Z [INFO]  raft: starting snapshot up to: index=9657
{"time":"2024-06-27T15:35:00.320042783Z","level":"INFO","msg":"creating new snapshot","component":"snapshot","path":"/app/data/snapshots/snapshots/14-9657-1719502500320.tmp"}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x50 pc=0xec2544]

goroutine 111 [running]:
github.com/vinceanalytics/vince/internal/lsm.(*PartStore).Compact.func1()
	/home/runner/work/vince/vince/internal/lsm/part.go:114 +0xa4
github.com/vinceanalytics/vince/internal/lsm.(*PartStore).Compact(0xc007544200, 0xc0071dba10)
	/home/runner/work/vince/vince/internal/lsm/part.go:146 +0x2e5
github.com/vinceanalytics/vince/internal/lsm.(*Tree).Compact(0xc007176300, 0xc00d745008)
	/home/runner/work/vince/vince/internal/lsm/lsm.go:268 +0xd6
github.com/vinceanalytics/vince/internal/session.(*Session).Persist(0xc0072fa0f0, 0xc00d745008)
	/home/runner/work/vince/vince/internal/session/session.go:101 +0x30
github.com/vinceanalytics/vince/internal/cluster/snapshots.(*Arrow).Backup(0xc00d64f0b0, {0x76bb44ea0820, 0xc00009ba40})
	/home/runner/work/vince/vince/internal/cluster/snapshots/lsm.go:38 +0x13e
github.com/vinceanalytics/vince/internal/cluster/snapshots.(*Arrow).Persist(0xc00d640570?, {0x1d4d390?, 0xc00009ba40?})
	/home/runner/work/vince/vince/internal/cluster/snapshots/lsm.go:27 +0x3e
github.com/hashicorp/raft.(*Raft).takeSnapshot(0xc00714c2c8)
	/home/runner/go/pkg/mod/github.com/hashicorp/[email protected]/snapshot.go:190 +0x8c8
github.com/hashicorp/raft.(*Raft).runSnapshots(0xc00714c2c8)
	/home/runner/go/pkg/mod/github.com/hashicorp/[email protected]/snapshot.go:82 +0x3ee
github.com/hashicorp/raft.(*raftState).goFunc.func1()
	/home/runner/go/pkg/mod/github.com/hashicorp/[email protected]/state.go:149 +0x56
created by github.com/hashicorp/raft.(*raftState).goFunc in goroutine 1
	/home/runner/go/pkg/mod/github.com/hashicorp/[email protected]/state.go:147 +0x79

I don't know if it's the same error has previously or not

Jun 27 '24 15:06 Ziedelth

Hi, It seems there was a bug during compaction causing things to spiral

{"time":"2024-06-27T03:52:12.085446794Z","level":"INFO","msg":"exiting compaction loop","component":"lsm-tree"} panic: arrow/array: field 15 has 18410 rows. want=18412

This message signals that. This probably means a corrupt snapshot was saved . One remedy is to manually delete the offending snapshot. Deleting /app/data/snapshots/snapshots/13-9212-1719460105736.tmp might help.

I am still working on a better storage for vince using roaring bitmaps. All of these issues will go away.

Jun 27 '24 16:06 gernest

I deleted all the snapshots, but the problem seems to come back quite quickly.

{"time":"2024-06-27T16:57:46.228615155Z","level":"INFO","msg":"creating new snapshot","component":"snapshot","path":"/app/data/snapshots/snapshots/15-9731-1719507466228.tmp"}
2024-06-27T16:57:46.434Z [ERROR] raft: failed to take snapshot: error="failed to persist snapshot: arrow/ipc: tried to write record batch with different schema"
{"time":"2024-06-27T16:59:02.739344349Z","level":"INFO","msg":"Exiting flushing loop","component":"session"}
{"time":"2024-06-27T16:59:02.739401442Z","level":"INFO","msg":"exiting compaction loop","component":"lsm-tree"}
{"time":"2024-06-27T16:59:02.800452698Z","level":"INFO","msg":"store closed ","component":"store","nodeId":"1","listen_address":"localhost:4002"}
{"time":"2024-06-27T16:59:02.800512641Z","level":"ERROR","msg":"Exited process","err":"http: Server closed"}
{"time":"2024-06-27T16:59:03.599825893Z","level":"INFO","msg":"Opening store","component":"store","nodeId":"1","listening":"localhost:4002"}
{"time":"2024-06-27T16:59:03.608718775Z","level":"INFO","msg":"Setup log info","component":"store","firstIdxOnOpen":1,"lastIdxOnOpen":9731,"lastAppliedIdxOnOpen":0,"lastCommandIdxOnOpen":0}
2024-06-27T16:59:03.724Z [INFO]  raft: initial configuration: index=1 servers="[{Suffrage:Voter ID:1 Address:localhost:4002}]"
2024-06-27T16:59:03.724Z [INFO]  raft: entering follower state: follower="Node at localhost:4002 [Follower]" leader-address= leader-id=
{"time":"2024-06-27T16:59:03.725361508Z","level":"INFO","msg":"starting server","addr":":8080"}
{"time":"2024-06-27T16:59:03.725644953Z","level":"INFO","msg":"Start compaction loop","component":"lsm-tree","interval":"10m0s","compactSize":"16MiB"}
{"time":"2024-06-27T16:59:03.725798869Z","level":"INFO","msg":"Starting session flushing loop","component":"session","interval":"1m0s"}
{"time":"2024-06-27T16:59:03.725952546Z","level":"INFO","msg":"Starting grpc server","addr":"127.0.0.1:4002"}
2024-06-27T16:59:05.217Z [WARN]  raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
2024-06-27T16:59:05.217Z [INFO]  raft: entering candidate state: node="Node at localhost:4002 [Candidate]" term=16
2024-06-27T16:59:05.226Z [DEBUG] raft: voting for self: term=16 id=1
2024-06-27T16:59:05.244Z [DEBUG] raft: calculated votes needed: needed=1 term=16
2024-06-27T16:59:05.244Z [DEBUG] raft: vote granted: from=1 term=16 tally=1
2024-06-27T16:59:05.244Z [INFO]  raft: election won: term=16 tally=1
2024-06-27T16:59:05.244Z [INFO]  raft: entering leader state: leader="Node at localhost:4002 [Leader]"
{"time":"2024-06-27T16:59:05.256401611Z","level":"INFO","msg":"first log applied since node start, log","component":"store","index":3}
2024-06-27T17:01:20.799Z [INFO]  raft: starting snapshot up to: index=9734
{"time":"2024-06-27T17:01:20.79966202Z","level":"INFO","msg":"creating new snapshot","component":"snapshot","path":"/app/data/snapshots/snapshots/16-9734-1719507680799.tmp"}
2024-06-27T17:01:21.014Z [ERROR] raft: failed to take snapshot: error="failed to persist snapshot: arrow/ipc: tried to write record batch with different schema"
2024-06-27T17:03:36.078Z [INFO]  raft: starting snapshot up to: index=9736
{"time":"2024-06-27T17:03:36.078771019Z","level":"INFO","msg":"creating new snapshot","component":"snapshot","path":"/app/data/snapshots/snapshots/16-9736-1719507816078.tmp"}
2024-06-27T17:03:36.285Z [ERROR] raft: failed to take snapshot: error="failed to persist snapshot: arrow/ipc: tried to write record batch with different schema"
2024-06-27T17:06:43.516Z [INFO]  raft: starting snapshot up to: index=9737
{"time":"2024-06-27T17:06:43.516679031Z","level":"INFO","msg":"creating new snapshot","component":"snapshot","path":"/app/data/snapshots/snapshots/16-9737-1719508003516.tmp"}
2024-06-27T17:06:43.729Z [ERROR] raft: failed to take snapshot: error="failed to persist snapshot: arrow/ipc: tried to write record batch with different schema"
2024-06-27T17:10:04.887Z [INFO]  raft: starting snapshot up to: index=9739
{"time":"2024-06-27T17:10:04.887940415Z","level":"INFO","msg":"creating new snapshot","component":"snapshot","path":"/app/data/snapshots/snapshots/16-9739-1719508204887.tmp"}
2024-06-27T17:10:05.098Z [ERROR] raft: failed to take snapshot: error="failed to persist snapshot: arrow/ipc: tried to write record batch with different schema"
2024-06-27T17:12:07.690Z [INFO]  raft: starting snapshot up to: index=9742
{"time":"2024-06-27T17:12:07.690333977Z","level":"INFO","msg":"creating new snapshot","component":"snapshot","path":"/app/data/snapshots/snapshots/16-9742-1719508327690.tmp"}
2024-06-27T17:12:07.892Z [ERROR] raft: failed to take snapshot: error="failed to persist snapshot: arrow/ipc: tried to write record batch with different schema"
2024-06-27T17:14:21.748Z [INFO]  raft: starting snapshot up to: index=9743
{"time":"2024-06-27T17:14:21.749010333Z","level":"INFO","msg":"creating new snapshot","component":"snapshot","path":"/app/data/snapshots/snapshots/16-9743-1719508461748.tmp"}
2024-06-27T17:14:21.955Z [ERROR] raft: failed to take snapshot: error="failed to persist snapshot: arrow/ipc: tried to write record batch with different schema"

Jun 27 '24 17:06 Ziedelth

The crash still occurs, even after all snapshots have been deleted.

Jun 28 '24 17:06 Ziedelth

I am investigating this and working on a possible fix. I am also looking into the timeseries numbers issue.

The new storage I mentioned I was working on (based on roaring bitmaps is ready). So I will cleanup and implement migration path to the new backend. You will not lose any data.

I want migration to be automatic so you won't need to do anything. I hope to have a release candidate ready sometime next week.

The new storage is very big upgrade, so you will save a lot of storage cost and resources. Also it will ensure we get correct and consistent results.

Jun 28 '24 20:06 gernest

@Ziedelth what is the size of your data/raftdb file ?

Jun 29 '24 23:06 gernest

$ ls -lash data/
total 6,5M
4,0K drwxr-xr-x 3 root     root     4,0K juin  30 10:32 .
4,0K drwxrwxr-x 3 ziedelth ziedelth 4,0K mars  13 13:45 ..
6,5M -rw------- 1 root     root     8,0M juin  30 10:31 raftdb
4,0K drwxr-xr-x 3 root     root     4,0K juin  27 18:59 snapshots
   0 -rw------- 1 root     root        0 juin  29 02:49 vince-snapshot-228007530
   0 -rw------- 1 root     root        0 juin  29 10:39 vince-snapshot-2757039809
   0 -rw------- 1 root     root        0 juin  27 05:52 vince-snapshot-532277653

Jun 30 '24 08:06 Ziedelth

So I've just had another problem. Following an update with apt (apt update and apt upgrade due to the SSH security flaw). All my docker containers have restarted, including the vince container. Except that I see that the latter has "lost" its data (I don't know if it's permanent or not).

As usual, I look at the site's timeseries and see that it only returns 2 days.

curl --location 'https://vince.shikkanime.fr/api/v1/stats/timeseries?site_id=shikkanime.fr&period=6mo&metrics=visitors%2Cvisits%2Cpageviews%2Cviews_per_visit%2Cbounce_rate%2Cvisit_duration' \
--header 'Authorization: Bearer xxx'

{
    "results": [
        {
            "timestamp": "2024-07-02T00:00:00Z",
            "values": {
                "bounce_rate": 0,
                "pageviews": 1,
                "views_per_visit": 0,
                "visit_duration": 0,
                "visitors": 1,
                "visits": 0
            }
        },
        {
            "timestamp": "2024-07-01T00:00:00Z",
            "values": {
                "bounce_rate": 1,
                "pageviews": 16,
                "views_per_visit": 1.0666666666666667,
                "visit_duration": 0,
                "visitors": 1,
                "visits": 15
            }
        }
    ]
}

When I look at the files, I get this:

$ ls -lash data/ data/snapshots/snapshots/
data/:
total 7,7M
4,0K drwxr-xr-x 3 root     root     4,0K juil.  2 20:56 .
4,0K drwxrwxr-x 3 ziedelth ziedelth 4,0K mars  13 13:45 ..
7,7M -rw------- 1 root     root     8,0M juil.  2 20:53 raftdb
4,0K drwxr-xr-x 3 root     root     4,0K juin  27 18:59 snapshots
   0 -rw------- 1 root     root        0 juin  29 02:49 vince-snapshot-228007530
   0 -rw------- 1 root     root        0 juin  29 10:39 vince-snapshot-2757039809
   0 -rw------- 1 root     root        0 juin  27 05:52 vince-snapshot-532277653

data/snapshots/snapshots/:
total 24K
4,0K drwxr-xr-x 6 root root 4,0K juil.  2 20:56 .
4,0K drwxr-xr-x 3 root root 4,0K juin  27 18:59 ..
4,0K drwxr-xr-x 2 root root 4,0K juin  28 14:09 16-11112-1719576543580.tmp
4,0K drwxr-xr-x 2 root root 4,0K juin  29 02:49 17-12131-1719622145756.tmp
4,0K drwxr-xr-x 2 root root 4,0K juin  29 10:39 18-12718-1719650346868.tmp
4,0K drwxr-xr-x 2 root root 4,0K juil.  1 16:09 19-15202-1719842948817.tmp

What's weird is that I've got some data from yesterday, even though I did the reboot today. Is this due to a mistake on my part?

Jul 02 '24 18:07 Ziedelth

After few days, I only get the data of 1st July, but the database seems to grow...

{
    "results": [
        {
            "timestamp": "2024-07-01T00:00:00Z",
            "values": {
                "bounce_rate": 1,
                "pageviews": 16,
                "views_per_visit": 4,
                "visit_duration": 55.896,
                "visitors": 1,
                "visits": 4
            }
        }
    ]
}

$ ls -lash data/ data/snapshots/snapshots/
data/:
total 8,2M
4,0K drwxr-xr-x 3 root     root     4,0K juil.  5 11:24 .
4,0K drwxrwxr-x 3 ziedelth ziedelth 4,0K mars  13 13:45 ..
8,2M -rw------- 1 root     root      16M juil.  5 11:08 raftdb
4,0K drwxr-xr-x 3 root     root     4,0K juin  27 18:59 snapshots
   0 -rw------- 1 root     root        0 juin  29 02:49 vince-snapshot-228007530
   0 -rw------- 1 root     root        0 juin  29 10:39 vince-snapshot-2757039809
   0 -rw------- 1 root     root        0 juin  27 05:52 vince-snapshot-532277653

data/snapshots/snapshots/:
total 24K
4,0K drwxr-xr-x 6 root root 4,0K juil.  5 11:24 .
4,0K drwxr-xr-x 3 root root 4,0K juin  27 18:59 ..
4,0K drwxr-xr-x 2 root root 4,0K juin  28 14:09 16-11112-1719576543580.tmp
4,0K drwxr-xr-x 2 root root 4,0K juin  29 02:49 17-12131-1719622145756.tmp
4,0K drwxr-xr-x 2 root root 4,0K juin  29 10:39 18-12718-1719650346868.tmp
4,0K drwxr-xr-x 2 root root 4,0K juil.  1 16:09 19-15202-1719842948817.tmp

Jul 05 '24 09:07 Ziedelth

@Ziedelth please be patient a bit, it is good the database is storing the events. The new storage is almost rady and I have tested migration to be working.

The new storage is here https://github.com/gernest/rbf .

Keep the vince instance running, we will migrate the whole database and you will be able to query full history .

I'm sorry for being quiet, I have been busy ramping up the work.

Jul 05 '24 09:07 gernest

Ah! I apologize if my messages seemed urgent, but that's not the case. I thought you hadn't seen it or something.

It's not urgent, it's just to warn about bugs that happen to me while I'm trying to use it.

Take your time

Jul 05 '24 09:07 Ziedelth

@Ziedelth no worries!

I'm happy that you are reporting these bugs.

I hope we will make vince better with the anticipated release.

Jul 05 '24 10:07 gernest