Panic on restarting oklog docker container multiple times.
I have a docker compose file setup like this:
...
oklog:
build: ./oklog
container_name: ${OKLOG_CONTAINER_NAME}
command: ingeststore -store.segment-replication-factor 1
ports:
- 7651:7651
- 7650:7650
- 7653:7653
- 7659:7659
volumes:
- ${OSX_LOG_DIR}:/data
logspout:
image: gliderlabs/logspout
container_name: ${LOGSPOUT_CONTAINER_NAME}
command: raw+tcp://${OKLOG_CONTAINER_NAME}:7651
depends_on:
- oklog
volumes:
- /var/run/docker.sock:/var/run/docker.sock
links:
- oklog
After multiple docker-compose up and docker-compose down commands(there doesn't seem to be a consistency to the number of ups and downs), the oklog container panics with:
$docker logs -f oklog
ts=2018-02-02T19:16:00.101587594Z level=info cluster_bind=0.0.0.0:7659
ts=2018-02-02T19:16:00.110157207Z level=info user_bind_host=0.0.0.0 user_advertise_host= calculated_advertise_ip=172.18.0.2
ts=2018-02-02T19:16:00.112425772Z level=info fast=tcp://0.0.0.0:7651
ts=2018-02-02T19:16:00.112505032Z level=info durable=tcp://0.0.0.0:7652
ts=2018-02-02T19:16:00.112891239Z level=info bulk=tcp://0.0.0.0:7653
ts=2018-02-02T19:16:00.113107131Z level=info API=tcp://0.0.0.0:7650
ts=2018-02-02T19:16:00.161997474Z level=info ingest_path=data/ingest
ts=2018-02-02T19:16:07.109384344Z level=info store_path=data/store
panic: close data/store/bcd73f75-18ce-415e-ad9b-bf4025fd7755.active: file already closed
goroutine 47 [running]:
github.com/oklog/oklog/pkg/store.mergeRecordsToLog.func1(0xc420455b38)
/Users/adnaan/code/gopath/src/github.com/oklog/oklog/pkg/store/read.go:150 +0x65
github.com/oklog/oklog/pkg/store.mergeRecordsToLog(0xd6a200, 0xc42007a410, 0x8000000, 0xc42011d180, 0x34, 0x34, 0x100001a0, 0xd5e780, 0xc420166040)
/Users/adnaan/code/gopath/src/github.com/oklog/oklog/pkg/store/read.go:239 +0xa9e
github.com/oklog/oklog/pkg/store.(*Compacter).compact(0xc420068b40, 0xa4e1e9, 0xb, 0xc42003edf0, 0x0, 0x0, 0x0)
/Users/adnaan/code/gopath/src/github.com/oklog/oklog/pkg/store/compact.go:126 +0x2f7
github.com/oklog/oklog/pkg/store.(*Compacter).Run.func1()
/Users/adnaan/code/gopath/src/github.com/oklog/oklog/pkg/store/compact.go:52 +0x71
github.com/oklog/oklog/pkg/store.(*Compacter).Run(0xc420068b40)
/Users/adnaan/code/gopath/src/github.com/oklog/oklog/pkg/store/compact.go:62 +0x2f5
main.runIngestStore.func14(0x0, 0x0)
/Users/adnaan/code/gopath/src/github.com/oklog/oklog/cmd/oklog/ingeststore.go:441 +0x2a
github.com/oklog/oklog/vendor/github.com/oklog/run.(*Group).Run.func1(0xc420068ba0, 0xc420040ec0, 0xc420040ef0)
/Users/adnaan/code/gopath/src/github.com/oklog/oklog/vendor/github.com/oklog/run/group.go:38 +0x27
created by github.com/oklog/oklog/vendor/github.com/oklog/run.(*Group).Run
/Users/adnaan/code/gopath/src/github.com/oklog/oklog/vendor/github.com/oklog/run/group.go:37 +0xa8
I found the issue for 0.3.0 release and rebuilt oklog:master to check if it's already fixed.
Thank you.
Want to help on this, but haven't been able to reproduce it again. Any intuitions about what could be the problem?
It looks like the operative panic is here:
// Per-segment state.
writeSegment, err := dst.Create()
if err != nil {
return n, err
}
defer func() {
// Don't leak active segments.
if writeSegment != nil {
if deleteErr := writeSegment.Delete(); deleteErr != nil {
panic(deleteErr) // <-------------
}
}
}()
In the production code path, the abstract WriteSegment interface is implemented by concrete fileWriteSegment types, here's the code that creates them:
func (fl *fileLog) Create() (WriteSegment, error) {
filename := filepath.Join(fl.root, fmt.Sprintf("%s%s", uuid.New(), extActive))
f, err := fl.filesys.Create(filename)
if err != nil {
return nil, err
}
return &fileWriteSegment{fl.filesys, f}, nil
}
And for completeness here's the complete definition and methodset on that type:
type fileWriteSegment struct {
fs fs.Filesystem
f fs.File
}
func (w fileWriteSegment) Write(p []byte) (int, error) {
return w.f.Write(p)
}
// Close the segment and make it available for query.
func (w fileWriteSegment) Close(low, high ulid.ULID) error {
if err := w.f.Close(); err != nil {
return err
}
oldname := w.f.Name()
oldpath := filepath.Dir(oldname)
newname := filepath.Join(oldpath, fmt.Sprintf("%s-%s%s", low.String(), high.String(), extFlushed))
if w.fs.Exists(newname) {
return errors.Errorf("file %s already exists", newname)
}
return w.fs.Rename(oldname, newname)
}
// Delete the segment.
func (w fileWriteSegment) Delete() error {
if err := w.f.Close(); err != nil {
return err
}
return w.fs.Remove(w.f.Name())
}
The task is to figure out how the underlying fs.File gets closed in between the Create and Delete. There may be a bug in the function itself, but I'd put the probably there relatively low. I suspect instead there is some interference on the filesystem between the compactor code and the store query code. If that's true, then the correct way to deal with this error is for the query code to fail more gracefully.
Hopefully this is something to start with!
This seems to be a heisenbug. But now I am repeatedly seeing another with the same setup:
ts=2018-02-22T14:07:07.839162625Z level=info cluster_bind=0.0.0.0:7659 ts=2018-02-22T14:07:07.850407666Z level=info fast=tcp://0.0.0.0:7651 ts=2018-02-22T14:07:07.850480233Z level=info durable=tcp://0.0.0.0:7652 ts=2018-02-22T14:07:07.850518041Z level=info bulk=tcp://0.0.0.0:7653 ts=2018-02-22T14:07:07.850546563Z level=info API=tcp://0.0.0.0:7650 ts=2018-02-22T14:07:07.851144625Z level=info ingest_path=data/ingest ts=2018-02-22T14:07:07.851653699Z level=info store_path=data/store panic: ulid: bad data size when unmarshaling
goroutine 44 [running]: github.com/oklog/ulid.MustParse(0xc4202413cb, 0x8, 0x0, 0x0) /Users/xla/dev/src/github.com/oklog/ulid/ulid.go:98 +0x88 github.com/oklog/oklog/pkg/store.(*fileLog).Overlapping(0xc420109d40, 0x40, 0x8bc93b, 0xc4203d6bc8, 0xc4203d6bc0, 0xc42012bb90) /Users/xla/dev/src/github.com/oklog/oklog/pkg/store/file_log.go:156 +0x70b github.com/oklog/oklog/pkg/store.(Log).Overlapping-fm(0xe00000040, 0xa93648, 0xc4203745a0, 0xa5af24, 0xb) /Users/xla/dev/src/github.com/oklog/oklog/pkg/store/compact.go:48 +0x2f github.com/oklog/oklog/pkg/store.(*Compacter).compact(0xc4203745a0, 0xa5af24, 0xb, 0xc4203d6df8, 0x0, 0x0, 0x0) /Users/xla/dev/src/github.com/oklog/oklog/pkg/store/compact.go:86 +0xf0 github.com/oklog/oklog/pkg/store.(*Compacter).Run.func1() /Users/xla/dev/src/github.com/oklog/oklog/pkg/store/compact.go:48 +0x7a github.com/oklog/oklog/pkg/store.(*Compacter).Run(0xc4203745a0) /Users/xla/dev/src/github.com/oklog/oklog/pkg/store/compact.go:58 +0x2b1 main.runIngestStore.func14(0x0, 0x0) /Users/xla/dev/src/github.com/oklog/oklog/cmd/oklog/ingeststore.go:429 +0x2a github.com/oklog/oklog/pkg/group.(*Group).Run.func1(0xc420374660, 0xc4202e91d0, 0xc4202e91e0) /Users/xla/dev/src/github.com/oklog/oklog/pkg/group/group.go:38 +0x27 created by github.com/oklog/oklog/pkg/group.(*Group).Run /Users/xla/dev/src/github.com/oklog/oklog/pkg/group/group.go:39 +0xa8
Will create a separate issue for this after investigation.