Docker container crash
Hi there! I've just seen that the docker container has crashed. Here is the error I got when I do a docker logs :
{"time":"2024-06-27T03:43:28.210460272Z","level":"INFO","msg":"creating new snapshot","component":"snapshot","path":"/app/data/snapshots/snapshots/13-9206-1719459808210.tmp"}
2024-06-27T03:43:28.404Z [ERROR] raft: failed to take snapshot: error="failed to persist snapshot: arrow/ipc: tried to write record batch with different schema"
2024-06-27T03:46:14.454Z [INFO] raft: starting snapshot up to: index=9210
{"time":"2024-06-27T03:46:14.454634112Z","level":"INFO","msg":"creating new snapshot","component":"snapshot","path":"/app/data/snapshots/snapshots/13-9210-1719459974454.tmp"}
2024-06-27T03:46:14.646Z [ERROR] raft: failed to take snapshot: error="failed to persist snapshot: arrow/ipc: tried to write record batch with different schema"
2024-06-27T03:48:25.736Z [INFO] raft: starting snapshot up to: index=9212
{"time":"2024-06-27T03:48:25.736896922Z","level":"INFO","msg":"creating new snapshot","component":"snapshot","path":"/app/data/snapshots/snapshots/13-9212-1719460105736.tmp"}
2024-06-27T03:48:25.906Z [ERROR] raft: failed to take snapshot: error="failed to persist snapshot: arrow/ipc: tried to write record batch with different schema"
2024-06-27T03:52:12.032Z [INFO] raft: starting snapshot up to: index=9219
{"time":"2024-06-27T03:52:12.032933328Z","level":"INFO","msg":"creating new snapshot","component":"snapshot","path":"/app/data/snapshots/snapshots/13-9219-1719460332032.tmp"}
{"time":"2024-06-27T03:52:12.085446794Z","level":"INFO","msg":"exiting compaction loop","component":"lsm-tree"}
panic: arrow/array: field 15 has 18410 rows. want=18412
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x50 pc=0xec2544]
goroutine 105 [running]:
github.com/vinceanalytics/vince/internal/lsm.(*PartStore).Compact.func1()
/home/runner/work/vince/vince/internal/lsm/part.go:114 +0xa4
panic({0x18a4600?, 0xc01402a550?})
/opt/hostedtoolcache/go/1.22.0/x64/src/runtime/panic.go:770 +0x132
github.com/apache/arrow/go/v15/arrow/array.(*RecordBuilder).NewRecord(0xc007530180)
/home/runner/go/pkg/mod/github.com/apache/arrow/go/[email protected]/arrow/array/record.go:349 +0x28e
github.com/vinceanalytics/vince/internal/staples.(*Merger).NewRecord(0xc0075565e8?, {{0x0, 0x0, 0x0}, {0x0, 0x0, 0x0}})
/home/runner/work/vince/vince/internal/staples/arrow.go:176 +0x67
github.com/vinceanalytics/vince/internal/lsm.(*PartStore).Compact(0xc0075301c0, 0xc00d579e58)
/home/runner/work/vince/vince/internal/lsm/part.go:137 +0x145
github.com/vinceanalytics/vince/internal/lsm.(*Tree).Compact(0xc0000ea100, 0x0)
/home/runner/work/vince/vince/internal/lsm/lsm.go:268 +0xd6
github.com/vinceanalytics/vince/internal/lsm.(*Tree).Start(0xc0000ea100, {0x1d4d048, 0xc007530c80})
/home/runner/work/vince/vince/internal/lsm/lsm.go:240 +0x288
created by github.com/vinceanalytics/vince/internal/session.(*Session).Start in goroutine 1
/home/runner/work/vince/vince/internal/session/session.go:139 +0x85
{"time":"2024-06-27T11:54:59.758964586Z","level":"INFO","msg":"Opening store","component":"store","nodeId":"1","listening":"localhost:4002"}
{"time":"2024-06-27T11:54:59.773558502Z","level":"INFO","msg":"Setup log info","component":"store","firstIdxOnOpen":1,"lastIdxOnOpen":9219,"lastAppliedIdxOnOpen":0,"lastCommandIdxOnOpen":0}
{"time":"2024-06-27T11:54:59.782291961Z","level":"WARN","msg":"found temporary snapshot","component":"snapshot","name":"13-9219-1719460332032.tmp"}
2024-06-27T11:55:00.400Z [INFO] raft: initial configuration: index=1 servers="[{Suffrage:Voter ID:1 Address:localhost:4002}]"
{"time":"2024-06-27T11:55:00.40124848Z","level":"INFO","msg":"starting server","addr":":8080"}
Another crash:
{"time":"2024-06-27T15:32:27.471176779Z","level":"INFO","msg":"creating new snapshot","component":"snapshot","path":"/app/data/snapshots/snapshots/14-9651-1719502347471.tmp"}
2024-06-27T15:32:27.674Z [ERROR] raft: failed to take snapshot: error="failed to persist snapshot: arrow/ipc: tried to write record batch with different schema"
2024-06-27T15:35:00.319Z [INFO] raft: starting snapshot up to: index=9657
{"time":"2024-06-27T15:35:00.320042783Z","level":"INFO","msg":"creating new snapshot","component":"snapshot","path":"/app/data/snapshots/snapshots/14-9657-1719502500320.tmp"}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x50 pc=0xec2544]
goroutine 111 [running]:
github.com/vinceanalytics/vince/internal/lsm.(*PartStore).Compact.func1()
/home/runner/work/vince/vince/internal/lsm/part.go:114 +0xa4
github.com/vinceanalytics/vince/internal/lsm.(*PartStore).Compact(0xc007544200, 0xc0071dba10)
/home/runner/work/vince/vince/internal/lsm/part.go:146 +0x2e5
github.com/vinceanalytics/vince/internal/lsm.(*Tree).Compact(0xc007176300, 0xc00d745008)
/home/runner/work/vince/vince/internal/lsm/lsm.go:268 +0xd6
github.com/vinceanalytics/vince/internal/session.(*Session).Persist(0xc0072fa0f0, 0xc00d745008)
/home/runner/work/vince/vince/internal/session/session.go:101 +0x30
github.com/vinceanalytics/vince/internal/cluster/snapshots.(*Arrow).Backup(0xc00d64f0b0, {0x76bb44ea0820, 0xc00009ba40})
/home/runner/work/vince/vince/internal/cluster/snapshots/lsm.go:38 +0x13e
github.com/vinceanalytics/vince/internal/cluster/snapshots.(*Arrow).Persist(0xc00d640570?, {0x1d4d390?, 0xc00009ba40?})
/home/runner/work/vince/vince/internal/cluster/snapshots/lsm.go:27 +0x3e
github.com/hashicorp/raft.(*Raft).takeSnapshot(0xc00714c2c8)
/home/runner/go/pkg/mod/github.com/hashicorp/[email protected]/snapshot.go:190 +0x8c8
github.com/hashicorp/raft.(*Raft).runSnapshots(0xc00714c2c8)
/home/runner/go/pkg/mod/github.com/hashicorp/[email protected]/snapshot.go:82 +0x3ee
github.com/hashicorp/raft.(*raftState).goFunc.func1()
/home/runner/go/pkg/mod/github.com/hashicorp/[email protected]/state.go:149 +0x56
created by github.com/hashicorp/raft.(*raftState).goFunc in goroutine 1
/home/runner/go/pkg/mod/github.com/hashicorp/[email protected]/state.go:147 +0x79
I don't know if it's the same error has previously or not
Hi, It seems there was a bug during compaction causing things to spiral
{"time":"2024-06-27T03:52:12.085446794Z","level":"INFO","msg":"exiting compaction loop","component":"lsm-tree"} panic: arrow/array: field 15 has 18410 rows. want=18412
This message signals that. This probably means a corrupt snapshot was saved . One remedy is to manually delete the offending snapshot. Deleting /app/data/snapshots/snapshots/13-9212-1719460105736.tmp might help.
I am still working on a better storage for vince using roaring bitmaps. All of these issues will go away.
I deleted all the snapshots, but the problem seems to come back quite quickly.
{"time":"2024-06-27T16:57:46.228615155Z","level":"INFO","msg":"creating new snapshot","component":"snapshot","path":"/app/data/snapshots/snapshots/15-9731-1719507466228.tmp"}
2024-06-27T16:57:46.434Z [ERROR] raft: failed to take snapshot: error="failed to persist snapshot: arrow/ipc: tried to write record batch with different schema"
{"time":"2024-06-27T16:59:02.739344349Z","level":"INFO","msg":"Exiting flushing loop","component":"session"}
{"time":"2024-06-27T16:59:02.739401442Z","level":"INFO","msg":"exiting compaction loop","component":"lsm-tree"}
{"time":"2024-06-27T16:59:02.800452698Z","level":"INFO","msg":"store closed ","component":"store","nodeId":"1","listen_address":"localhost:4002"}
{"time":"2024-06-27T16:59:02.800512641Z","level":"ERROR","msg":"Exited process","err":"http: Server closed"}
{"time":"2024-06-27T16:59:03.599825893Z","level":"INFO","msg":"Opening store","component":"store","nodeId":"1","listening":"localhost:4002"}
{"time":"2024-06-27T16:59:03.608718775Z","level":"INFO","msg":"Setup log info","component":"store","firstIdxOnOpen":1,"lastIdxOnOpen":9731,"lastAppliedIdxOnOpen":0,"lastCommandIdxOnOpen":0}
2024-06-27T16:59:03.724Z [INFO] raft: initial configuration: index=1 servers="[{Suffrage:Voter ID:1 Address:localhost:4002}]"
2024-06-27T16:59:03.724Z [INFO] raft: entering follower state: follower="Node at localhost:4002 [Follower]" leader-address= leader-id=
{"time":"2024-06-27T16:59:03.725361508Z","level":"INFO","msg":"starting server","addr":":8080"}
{"time":"2024-06-27T16:59:03.725644953Z","level":"INFO","msg":"Start compaction loop","component":"lsm-tree","interval":"10m0s","compactSize":"16MiB"}
{"time":"2024-06-27T16:59:03.725798869Z","level":"INFO","msg":"Starting session flushing loop","component":"session","interval":"1m0s"}
{"time":"2024-06-27T16:59:03.725952546Z","level":"INFO","msg":"Starting grpc server","addr":"127.0.0.1:4002"}
2024-06-27T16:59:05.217Z [WARN] raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
2024-06-27T16:59:05.217Z [INFO] raft: entering candidate state: node="Node at localhost:4002 [Candidate]" term=16
2024-06-27T16:59:05.226Z [DEBUG] raft: voting for self: term=16 id=1
2024-06-27T16:59:05.244Z [DEBUG] raft: calculated votes needed: needed=1 term=16
2024-06-27T16:59:05.244Z [DEBUG] raft: vote granted: from=1 term=16 tally=1
2024-06-27T16:59:05.244Z [INFO] raft: election won: term=16 tally=1
2024-06-27T16:59:05.244Z [INFO] raft: entering leader state: leader="Node at localhost:4002 [Leader]"
{"time":"2024-06-27T16:59:05.256401611Z","level":"INFO","msg":"first log applied since node start, log","component":"store","index":3}
2024-06-27T17:01:20.799Z [INFO] raft: starting snapshot up to: index=9734
{"time":"2024-06-27T17:01:20.79966202Z","level":"INFO","msg":"creating new snapshot","component":"snapshot","path":"/app/data/snapshots/snapshots/16-9734-1719507680799.tmp"}
2024-06-27T17:01:21.014Z [ERROR] raft: failed to take snapshot: error="failed to persist snapshot: arrow/ipc: tried to write record batch with different schema"
2024-06-27T17:03:36.078Z [INFO] raft: starting snapshot up to: index=9736
{"time":"2024-06-27T17:03:36.078771019Z","level":"INFO","msg":"creating new snapshot","component":"snapshot","path":"/app/data/snapshots/snapshots/16-9736-1719507816078.tmp"}
2024-06-27T17:03:36.285Z [ERROR] raft: failed to take snapshot: error="failed to persist snapshot: arrow/ipc: tried to write record batch with different schema"
2024-06-27T17:06:43.516Z [INFO] raft: starting snapshot up to: index=9737
{"time":"2024-06-27T17:06:43.516679031Z","level":"INFO","msg":"creating new snapshot","component":"snapshot","path":"/app/data/snapshots/snapshots/16-9737-1719508003516.tmp"}
2024-06-27T17:06:43.729Z [ERROR] raft: failed to take snapshot: error="failed to persist snapshot: arrow/ipc: tried to write record batch with different schema"
2024-06-27T17:10:04.887Z [INFO] raft: starting snapshot up to: index=9739
{"time":"2024-06-27T17:10:04.887940415Z","level":"INFO","msg":"creating new snapshot","component":"snapshot","path":"/app/data/snapshots/snapshots/16-9739-1719508204887.tmp"}
2024-06-27T17:10:05.098Z [ERROR] raft: failed to take snapshot: error="failed to persist snapshot: arrow/ipc: tried to write record batch with different schema"
2024-06-27T17:12:07.690Z [INFO] raft: starting snapshot up to: index=9742
{"time":"2024-06-27T17:12:07.690333977Z","level":"INFO","msg":"creating new snapshot","component":"snapshot","path":"/app/data/snapshots/snapshots/16-9742-1719508327690.tmp"}
2024-06-27T17:12:07.892Z [ERROR] raft: failed to take snapshot: error="failed to persist snapshot: arrow/ipc: tried to write record batch with different schema"
2024-06-27T17:14:21.748Z [INFO] raft: starting snapshot up to: index=9743
{"time":"2024-06-27T17:14:21.749010333Z","level":"INFO","msg":"creating new snapshot","component":"snapshot","path":"/app/data/snapshots/snapshots/16-9743-1719508461748.tmp"}
2024-06-27T17:14:21.955Z [ERROR] raft: failed to take snapshot: error="failed to persist snapshot: arrow/ipc: tried to write record batch with different schema"
The crash still occurs, even after all snapshots have been deleted.
I am investigating this and working on a possible fix. I am also looking into the timeseries numbers issue.
The new storage I mentioned I was working on (based on roaring bitmaps is ready). So I will cleanup and implement migration path to the new backend. You will not lose any data.
I want migration to be automatic so you won't need to do anything. I hope to have a release candidate ready sometime next week.
The new storage is very big upgrade, so you will save a lot of storage cost and resources. Also it will ensure we get correct and consistent results.
@Ziedelth what is the size of your data/raftdb file ?
$ ls -lash data/
total 6,5M
4,0K drwxr-xr-x 3 root root 4,0K juin 30 10:32 .
4,0K drwxrwxr-x 3 ziedelth ziedelth 4,0K mars 13 13:45 ..
6,5M -rw------- 1 root root 8,0M juin 30 10:31 raftdb
4,0K drwxr-xr-x 3 root root 4,0K juin 27 18:59 snapshots
0 -rw------- 1 root root 0 juin 29 02:49 vince-snapshot-228007530
0 -rw------- 1 root root 0 juin 29 10:39 vince-snapshot-2757039809
0 -rw------- 1 root root 0 juin 27 05:52 vince-snapshot-532277653
So I've just had another problem. Following an update with apt (apt update and apt upgrade due to the SSH security flaw). All my docker containers have restarted, including the vince container. Except that I see that the latter has "lost" its data (I don't know if it's permanent or not).
As usual, I look at the site's timeseries and see that it only returns 2 days.
curl --location 'https://vince.shikkanime.fr/api/v1/stats/timeseries?site_id=shikkanime.fr&period=6mo&metrics=visitors%2Cvisits%2Cpageviews%2Cviews_per_visit%2Cbounce_rate%2Cvisit_duration' \
--header 'Authorization: Bearer xxx'
{
"results": [
{
"timestamp": "2024-07-02T00:00:00Z",
"values": {
"bounce_rate": 0,
"pageviews": 1,
"views_per_visit": 0,
"visit_duration": 0,
"visitors": 1,
"visits": 0
}
},
{
"timestamp": "2024-07-01T00:00:00Z",
"values": {
"bounce_rate": 1,
"pageviews": 16,
"views_per_visit": 1.0666666666666667,
"visit_duration": 0,
"visitors": 1,
"visits": 15
}
}
]
}
When I look at the files, I get this:
$ ls -lash data/ data/snapshots/snapshots/
data/:
total 7,7M
4,0K drwxr-xr-x 3 root root 4,0K juil. 2 20:56 .
4,0K drwxrwxr-x 3 ziedelth ziedelth 4,0K mars 13 13:45 ..
7,7M -rw------- 1 root root 8,0M juil. 2 20:53 raftdb
4,0K drwxr-xr-x 3 root root 4,0K juin 27 18:59 snapshots
0 -rw------- 1 root root 0 juin 29 02:49 vince-snapshot-228007530
0 -rw------- 1 root root 0 juin 29 10:39 vince-snapshot-2757039809
0 -rw------- 1 root root 0 juin 27 05:52 vince-snapshot-532277653
data/snapshots/snapshots/:
total 24K
4,0K drwxr-xr-x 6 root root 4,0K juil. 2 20:56 .
4,0K drwxr-xr-x 3 root root 4,0K juin 27 18:59 ..
4,0K drwxr-xr-x 2 root root 4,0K juin 28 14:09 16-11112-1719576543580.tmp
4,0K drwxr-xr-x 2 root root 4,0K juin 29 02:49 17-12131-1719622145756.tmp
4,0K drwxr-xr-x 2 root root 4,0K juin 29 10:39 18-12718-1719650346868.tmp
4,0K drwxr-xr-x 2 root root 4,0K juil. 1 16:09 19-15202-1719842948817.tmp
What's weird is that I've got some data from yesterday, even though I did the reboot today. Is this due to a mistake on my part?
After few days, I only get the data of 1st July, but the database seems to grow...
{
"results": [
{
"timestamp": "2024-07-01T00:00:00Z",
"values": {
"bounce_rate": 1,
"pageviews": 16,
"views_per_visit": 4,
"visit_duration": 55.896,
"visitors": 1,
"visits": 4
}
}
]
}
$ ls -lash data/ data/snapshots/snapshots/
data/:
total 8,2M
4,0K drwxr-xr-x 3 root root 4,0K juil. 5 11:24 .
4,0K drwxrwxr-x 3 ziedelth ziedelth 4,0K mars 13 13:45 ..
8,2M -rw------- 1 root root 16M juil. 5 11:08 raftdb
4,0K drwxr-xr-x 3 root root 4,0K juin 27 18:59 snapshots
0 -rw------- 1 root root 0 juin 29 02:49 vince-snapshot-228007530
0 -rw------- 1 root root 0 juin 29 10:39 vince-snapshot-2757039809
0 -rw------- 1 root root 0 juin 27 05:52 vince-snapshot-532277653
data/snapshots/snapshots/:
total 24K
4,0K drwxr-xr-x 6 root root 4,0K juil. 5 11:24 .
4,0K drwxr-xr-x 3 root root 4,0K juin 27 18:59 ..
4,0K drwxr-xr-x 2 root root 4,0K juin 28 14:09 16-11112-1719576543580.tmp
4,0K drwxr-xr-x 2 root root 4,0K juin 29 02:49 17-12131-1719622145756.tmp
4,0K drwxr-xr-x 2 root root 4,0K juin 29 10:39 18-12718-1719650346868.tmp
4,0K drwxr-xr-x 2 root root 4,0K juil. 1 16:09 19-15202-1719842948817.tmp
@Ziedelth please be patient a bit, it is good the database is storing the events. The new storage is almost rady and I have tested migration to be working.
The new storage is here https://github.com/gernest/rbf .
Keep the vince instance running, we will migrate the whole database and you will be able to query full history .
I'm sorry for being quiet, I have been busy ramping up the work.
Ah! I apologize if my messages seemed urgent, but that's not the case. I thought you hadn't seen it or something.
It's not urgent, it's just to warn about bugs that happen to me while I'm trying to use it.
Take your time
@Ziedelth no worries!
I'm happy that you are reporting these bugs.
I hope we will make vince better with the anticipated release.