drone-volume-cache icon indicating copy to clipboard operation
drone-volume-cache copied to clipboard

fatal error: all goroutines are asleep - deadlock!

Open tonyffrench opened this issue 7 years ago • 7 comments

DRONE_MAX_PROCS is set to 1, on each of the servers, so this should not be a concurrency issue with multiple jobs running concurrently on the same host.

time="2018-06-28T04:00:25Z" level=info msg="Restoring cache from /cache/xxx/yyy/master.tar"
fatal error: all goroutines are asleep - deadlock!

goroutine 1 [chan receive]:
github.com/drone-plugins/drone-volume-cache/vendor/github.com/drone/drone-cache-lib/cache.restoreCache(0xc4200147c0, 0x1e, 0x5a1d20, 0x683d00, 0x5a1760, 0x683d00, 0x0, 0x0)
	/go/src/github.com/drone-plugins/drone-volume-cache/vendor/github.com/drone/drone-cache-lib/cache/cache.go:65 +0x26d
github.com/drone-plugins/drone-volume-cache/vendor/github.com/drone/drone-cache-lib/cache.Cache.Restore(0x5a1d20, 0x683d00, 0x5a1760, 0x683d00, 0xc4200147c0, 0x1e, 0xc4200147e0, 0x1e, 0xc42001c0b0, 0xa)
	/go/src/github.com/drone-plugins/drone-volume-cache/vendor/github.com/drone/drone-cache-lib/cache/cache.go:36 +0x81
main.(*plugin).Exec(0xc4200ad920, 0x2, 0x2)
	/go/src/github.com/drone-plugins/drone-volume-cache/plugin.go:46 +0x2b7
main.run(0xc42008c160, 0xc42008c160, 0xc420049997)
	/go/src/github.com/drone-plugins/drone-volume-cache/main.go:136 +0x406
github.com/drone-plugins/drone-volume-cache/vendor/github.com/urfave/cli.HandleAction(0x54e000, 0x58c860, 0xc42008c160, 0xc42005c0c0, 0x0)
	/go/src/github.com/drone-plugins/drone-volume-cache/vendor/github.com/urfave/cli/app.go:490 +0xc8
github.com/drone-plugins/drone-volume-cache/vendor/github.com/urfave/cli.(*App).Run(0xc4200881a0, 0xc42000e190, 0x1, 0x1, 0x0, 0x0)
	/go/src/github.com/drone-plugins/drone-volume-cache/vendor/github.com/urfave/cli/app.go:264 +0x5ee
main.main()
	/go/src/github.com/drone-plugins/drone-volume-cache/main.go:90 +0xadc

goroutine 5 [select]:
io.(*pipe).Write(0xc42007e500, 0xc4200b4000, 0x8000, 0x8000, 0x0, 0x0, 0x0)
	/usr/local/go/src/io/pipe.go:87 +0x1e3
io.(*PipeWriter).Write(0xc42000c040, 0xc4200b4000, 0x8000, 0x8000, 0x8000, 0x0, 0x0)
	/usr/local/go/src/io/pipe.go:153 +0x4c
io.copyBuffer(0x5a1140, 0xc42000c040, 0x5a1160, 0xc42000c048, 0xc4200b4000, 0x8000, 0x8000, 0xc420035f18, 0x18, 0xc420035f10)
	/usr/local/go/src/io/io.go:402 +0x240
io.Copy(0x5a1140, 0xc42000c040, 0x5a1160, 0xc42000c048, 0x0, 0xc420035f68, 0x18)

tonyffrench avatar Jun 28 '18 15:06 tonyffrench

Its possible this was caused by a disk space issue. Better error handling and logging would be preferable.

tonyffrench avatar Jun 28 '18 16:06 tonyffrench

We ran into this, but our disk is far from full:

Filesystem      Size  Used Avail Use% Mounted on
udev            2.0G     0  2.0G   0% /dev
tmpfs           396M   41M  355M  11% /run
/dev/vda1        78G   38G   40G  49% /

The rest of our caches work fine, but this one repository keeps on crashing when restoring the cache.

mvdan avatar Dec 11 '18 09:12 mvdan

Deleted the tar file from the host machine and restarted the build twice, and the cahe restoring is working again. Perhaps the tar file had been corrupted somehow.

mvdan avatar Dec 11 '18 09:12 mvdan

We've encountered this bug two more times in the last few weeks. It stalls development, which is a tad annoying. The fact that it only happens sometimes seems to point at a race condition somewhere.

I can't upload the tar files, though, as they contain private code. One possible explanation is that we're using this plugin on a VM, and we run up to two CI jobs at a time. So perhaps two jobs could end up writing the same tar file in the host filesystem at the same time, potentially producing a bad tar file.

Still, that shouldn't explain a deadlock. And the OP is running only one concurrent job at a time.

mvdan avatar Jan 16 '19 11:01 mvdan

But even then it shouldn't result in a deadlock. Are you sure it's not a space issue?

tboerger avatar Jan 16 '19 12:01 tboerger

As said before, not even close:

Filesystem      Size  Used Avail Use% Mounted on
/dev/vda1        78G   32G   46G  42% /

For completeness, here's the panic we got yesterday:

time="2019-01-16T05:25:19Z" level=info msg="Restoring cache from /cache/brankas-scraper.tar"
fatal error: all goroutines are asleep - deadlock!

goroutine 1 [chan receive]:
github.com/drone-plugins/drone-volume-cache/vendor/github.com/drone/drone-cache-lib/cache.restoreCache(0xc420076040, 0x1a, 0x5a1d20, 0x683d00, 0x5a1760, 0x683d00, 0x0, 0x0)
    /go/src/github.com/drone-plugins/drone-volume-cache/vendor/github.com/drone/drone-cache-lib/cache/cache.go:65 +0x26d
github.com/drone-plugins/drone-volume-cache/vendor/github.com/drone/drone-cache-lib/cache.Cache.Restore(0x5a1d20, 0x683d00, 0x5a1760, 0x683d00, 0xc420076040, 0x1a, 0xc420070060, 0x2f, 0xc420074060, 0x10)
    /go/src/github.com/drone-plugins/drone-volume-cache/vendor/github.com/drone/drone-cache-lib/cache/cache.go:36 +0x81
main.(*plugin).Exec(0xc420459920, 0x2, 0x2)
    /go/src/github.com/drone-plugins/drone-volume-cache/plugin.go:46 +0x2b7
main.run(0xc420090160, 0xc420090160, 0xc420049997)
    /go/src/github.com/drone-plugins/drone-volume-cache/main.go:136 +0x406
github.com/drone-plugins/drone-volume-cache/vendor/github.com/urfave/cli.HandleAction(0x54e000, 0x58c860, 0xc420090160, 0xc420088060, 0x0)
    /go/src/github.com/drone-plugins/drone-volume-cache/vendor/github.com/urfave/cli/app.go:490 +0xc8
github.com/drone-plugins/drone-volume-cache/vendor/github.com/urfave/cli.(*App).Run(0xc42008a1a0, 0xc420068170, 0x1, 0x1, 0x0, 0x0)
    /go/src/github.com/drone-plugins/drone-volume-cache/vendor/github.com/urfave/cli/app.go:264 +0x5ee
main.main()
    /go/src/github.com/drone-plugins/drone-volume-cache/main.go:90 +0xadc

goroutine 18 [select]:
io.(*pipe).Write(0xc42007a500, 0xc4200bda00, 0x6600, 0x6600, 0x1a00, 0x0, 0x0)
    /usr/local/go/src/io/pipe.go:87 +0x1e3
io.(*PipeWriter).Write(0xc420078030, 0xc4200bc000, 0x8000, 0x8000, 0x8000, 0x0, 0x0)
    /usr/local/go/src/io/pipe.go:153 +0x4c
io.copyBuffer(0x5a1140, 0xc420078030, 0x5a1160, 0xc420078038, 0xc4200bc000, 0x8000, 0x8000, 0xc420030f18, 0x18, 0xc420030f10)
    /usr/local/go/src/io/io.go:402 +0x240
io.Copy(0x5a1140, 0xc420078030, 0x5a1160, 0xc420078038, 0x0, 0xc420030f68, 0x18)
    /usr/local/go/src/io/io.go:362 +0x5a
main.(*localCache).Get(0x683d00, 0xc420076040, 0x1a, 0x5a1140, 0xc420078030, 0x0, 0x0)
    /go/src/github.com/drone-plugins/drone-volume-cache/local.go:20 +0xa8
github.com/drone-plugins/drone-volume-cache/vendor/github.com/drone/drone-cache-lib/cache.restoreCache.func1(0xc420078030, 0xc4200880c0, 0x5a1d20, 0x683d00, 0xc420076040, 0x1a)
    /go/src/github.com/drone-plugins/drone-volume-cache/vendor/github.com/drone/drone-cache-lib/cache/cache.go:61 +0x88
created by github.com/drone-plugins/drone-volume-cache/vendor/github.com/drone/drone-cache-lib/cache.restoreCache
    /go/src/github.com/drone-plugins/drone-volume-cache/vendor/github.com/drone/drone-cache-lib/cache/cache.go:58 +0x1fb

mvdan avatar Jan 16 '19 12:01 mvdan

In case this is useful to others, we ended up doing caching ourselves completely from scratch. It took just a tiny Docker image and a few lines of shell:

$ cat images/alpine-tar/Dockerfile 
FROM alpine:3.11.0

# Make sure we have GNU tar too; by default only busybox's tar is installed.
RUN apk add --no-cache tar zstd

$ cat .drone.yml # a shortened example
kind: pipeline
name: default

volumes:
- name: cache
  host:
    path: /root/drone/build-cache

steps:
- name: restore-cache
  image: YOUR-REGISTRY/alpine-tar:3.11.0
  environment:
    CACHE_FILE: /cache/${DRONE_REPO_NAME}.tar.zst
  commands:
    - mkdir -p .cache/
    - if test -e $CACHE_FILE; then tar -C .cache/ -xf $CACHE_FILE; fi
  volumes:
    - name: cache
      path: /cache

[steps that use the cache at .cache/]

- name: rebuild-cache
  depends_on: [YOUR-STEPS]
  image: YOUR-REGISTRY/alpine-tar:3.11.0
  environment:
    CACHE_FILE: /cache/${DRONE_REPO_NAME}.tar.zst
    TEMP_CACHE_FILE: /cache/${DRONE_REPO_NAME}-tmp-${DRONE_BUILD_NUMBER}.tar.zst
  commands:
    - tar -C .cache/ -caf $TEMP_CACHE_FILE .
    - mv $TEMP_CACHE_FILE $CACHE_FILE # use an atomic write
  volumes:
    - name: cache
      path: /cache

This has several advantages over this plugin, or really any other "volume cache" plugin I could find:

  • It uses a .tar.zst file, so it compresses pretty well while not using much CPU. Seems like a good balance between CPU and disk speed.
  • Its writes are atomic, so even if multiple rebuilds happen at the same time, one wins without corruption. We've also seen no deadlocks like this issue in almost a year.
  • It's highly configurable by design, since you can change the few lines of shell as you wish.

We've also seen a moderate speed-up when moving from drone-volume-cache to this method, presumably due to the compression. For example, restores went down from ~25s to ~15s on average, while the decompressed size stayed the same.

mvdan avatar Jan 09 '20 01:01 mvdan