zed icon indicating copy to clipboard operation
zed copied to clipboard

Spill-to-disk directory not cleaned up after "no space left on device" failure

Open philrz opened this issue 3 years ago • 1 comments

Repro is with zq commit af6d47e.

To make it easy to run out of disk space, I make a small ram disk on my Linux VM and use it as my $ZAR_ROOT.

$ sudo mount -t tmpfs -o size=50m tmpfs /tmp/logs
$ export ZAR_ROOT="/tmp/logs"

Having cloned the zq-sample-data, I attempt a zar import that exceeds the amount of disk space available.

$ zq zng/*.gz tzng/*.gz zeek-default/* | zar import -s 25MB -
archive.Import: run failed: write /tmp/logs/zd/20180324/d-1iVjcvpBQKNNXc7yHBCQqhqs4qz.zng: no space left on device

After the failure, the spill-to-disk directory is still left behind.

$ du -sh /tmp/zq-spill-596995875/
503M	/tmp/zq-spill-596995875/

philrz avatar Oct 06 '20 15:10 philrz

@philrz: With #4134 and #4135, this should be fixed for both aggregations and sorts, but I'll leave closing it to you.

nwt avatar Oct 13 '22 19:10 nwt

This one has been a challenge to verify. For starters, the original repro cannot be repeated because spill-to-disk no longer occurs during a zed load (the modern equivalent of what was once zar import).

Looking to situations like sorts and aggregations where spill-to-disk can still occur, my tests where cleanup is successful with commits that have the fixes from #4134 and #4135 also showed successful cleanup at commit 1e563dd that came right before those fixes. Likewise, tests where spill directories were left behind before the fixes still show leftover spill directories after the fixes. For an example of this latter case, see #4171 which I just opened.

That said, I'm sure there were some genuine improvements from #4134 and #4135, and I did find at least one test that revealed a difference. I started with a Zed lake loaded with the test data from s3://brim-sampledata/fdns/fdns-a.zng.gz.

zed create -orderby timestamp:asc fdns
zed use fdns@main
zed load fdns-a.zng.gz

With a zed serve running in one window, I then start this command in another window to watch for the appearance of a spill directory:

while true; date; do ls -l ${TMPDIR}zed-spill-*; sleep 1; done

Then in yet another window, I start an aggregation that will spill.

zed query 'count() by value'

As shown in this first video, at Zed commit 1e563dd that comes right before the fixes, hitting Ctrl-C on the zed query shortly after the appearance of the spill directory causes zed serve to show the following panic and exit nearly every time.

panic: context canceled

goroutine 129 [running]:
github.com/brimdata/zed/runtime/op/groupby.(*Aggregator).Consume(0xc000133c20, {0x1d70520?, 0xc0006d57e8?}, 0xc00053b400?)
	/Users/phil/work/zed/runtime/op/groupby/groupby.go:333 +0x990
github.com/brimdata/zed/runtime/op/groupby.(*Proc).run(0xc0000d6b40)
	/Users/phil/work/zed/runtime/op/groupby/groupby.go:210 +0x24c
created by github.com/brimdata/zed/runtime/op/groupby.(*Proc).Pull.func1
	/Users/phil/work/zed/runtime/op/groupby/groupby.go:163 +0x5a

https://user-images.githubusercontent.com/5934157/199379753-e8383d85-28ec-468a-9d9c-e7f1591cf096.mp4

Meanwhile, with Zed commit c756440 which is the first that has the fixes from both #4134 and #4135, the panic is no longer triggered.

https://user-images.githubusercontent.com/5934157/199379997-5c6bc8ef-1511-4cb8-b6a2-e4e1ac0353fb.mp4

At this point I'm considering that progress and closing this issue. In addition to #4171 and another existing issue #3782, I'll continue opening others as I see specific repro examples.

philrz avatar Nov 02 '22 02:11 philrz