db-benchmark icon indicating copy to clipboard operation
db-benchmark copied to clipboard

Clean-up `/tmp` dirctory

Open sl-solution opened this issue 2 years ago • 13 comments
trafficstars

I notice that on-disk solutions may create large temporary files during their runs, however, they may not clean up afterward (e.g. polars creates .ipc files). This may cause the undefined exception error for other solutions, when they run within the same session.

sl-solution avatar May 25 '23 09:05 sl-solution

Today I have done benchmark for DuckDB https://youtu.be/zVR77B2bDR0 The tmp file shall be cleaned after process completed.

hkpeaks avatar May 30 '23 14:05 hkpeaks

Can you provide reproducible steps for when an undefined exception is caused by a temporary file from a different solution (in the same session)?

Tmonster avatar May 31 '23 11:05 Tmonster

Can you provide reproducible steps for when an undefined exception is caused by a temporary file from a different solution (in the same session)?

I found this when I was using the _utils/repro.sh script to reproduce result for smaller data sets on a computer with limited hard disk. I noted that after some point all solutions failed to produce any result, and with a little investigation I figured out that the hard drive was full (due to temporary file created during the benchmark run). I would image for large data sets the /tmp directory would be bloated by huge files.

sl-solution avatar May 31 '23 11:05 sl-solution

I can confirm that disk space was never a concern and scripts generally won't be handling this kind of exception.

jangorecki avatar May 31 '23 15:05 jangorecki

I noticed this issue too actually when getting the benchmark back up and running. I never had the issue where another solution encountered an undefined exception.

Tmonster avatar Jun 07 '23 08:06 Tmonster

@sl-solution If you still believe this would be a problem, feel free to open a PR to automatically clean the /tmp directory after every run.

Tmonster avatar Jun 07 '23 08:06 Tmonster

@sl-solution If you still believe this would be a problem, feel free to open a PR to automatically clean the /tmp directory after every run.

In Juliads I made sure it is done automatically, however, I am not sure deleting everything from /tmp is a good idea, since some of the files may be essential for other system process.

sl-solution avatar Jun 07 '23 08:06 sl-solution

I wouldn't delete everything from /tmp of course, but for R solutions it would be everything in tempdir(). Potentially all R solutions could use the same location for tmpdir() and then it could be cleaned up when the benchmarking ends

Tmonster avatar Jun 07 '23 08:06 Tmonster

I guess for polars it should be straightforward, since it uses absolute path and constant name for temporary files.

sl-solution avatar Jun 07 '23 08:06 sl-solution

I think sorting of billion rows requires the use of temporary. I have coded for billion-row jointable/filter/groupby using only 32GB ram, in fact it is certified no need using temp file.

hkpeaks avatar Jun 07 '23 09:06 hkpeaks

I think a systematic way to solve the issue is to assign a directory for temporary files, and ask every solution to use solely the assigned directory for on-disk calculations. The launcher can clean the directory after each run.

sl-solution avatar Jun 10 '23 08:06 sl-solution

Since the new machine has more memory, and instance storage, this has become less of an issue. Can this therefore be closed?

Tmonster avatar Nov 09 '23 09:11 Tmonster

Since the new machine has more memory, and instance storage, this has become less of an issue. Can this therefore be closed?

I guess as long as solutions keep using temp files, this will be an issue.

sl-solution avatar Nov 09 '23 11:11 sl-solution