db-benchmark
db-benchmark copied to clipboard
Clean-up `/tmp` dirctory
I notice that on-disk solutions may create large temporary files during their runs, however, they may not clean up afterward (e.g. polars creates .ipc files). This may cause the undefined exception error for other solutions, when they run within the same session.
Today I have done benchmark for DuckDB https://youtu.be/zVR77B2bDR0 The tmp file shall be cleaned after process completed.
Can you provide reproducible steps for when an undefined exception is caused by a temporary file from a different solution (in the same session)?
Can you provide reproducible steps for when an
undefined exceptionis caused by a temporary file from a different solution (in the same session)?
I found this when I was using the _utils/repro.sh script to reproduce result for smaller data sets on a computer with limited hard disk. I noted that after some point all solutions failed to produce any result, and with a little investigation I figured out that the hard drive was full (due to temporary file created during the benchmark run). I would image for large data sets the /tmp directory would be bloated by huge files.
I can confirm that disk space was never a concern and scripts generally won't be handling this kind of exception.
I noticed this issue too actually when getting the benchmark back up and running. I never had the issue where another solution encountered an undefined exception.
@sl-solution If you still believe this would be a problem, feel free to open a PR to automatically clean the /tmp directory after every run.
@sl-solution If you still believe this would be a problem, feel free to open a PR to automatically clean the
/tmpdirectory after every run.
In Juliads I made sure it is done automatically, however, I am not sure deleting everything from /tmp is a good idea, since some of the files may be essential for other system process.
I wouldn't delete everything from /tmp of course, but for R solutions it would be everything in tempdir(). Potentially all R solutions could use the same location for tmpdir() and then it could be cleaned up when the benchmarking ends
I guess for polars it should be straightforward, since it uses absolute path and constant name for temporary files.
I think sorting of billion rows requires the use of temporary. I have coded for billion-row jointable/filter/groupby using only 32GB ram, in fact it is certified no need using temp file.
I think a systematic way to solve the issue is to assign a directory for temporary files, and ask every solution to use solely the assigned directory for on-disk calculations. The launcher can clean the directory after each run.
Since the new machine has more memory, and instance storage, this has become less of an issue. Can this therefore be closed?
Since the new machine has more memory, and instance storage, this has become less of an issue. Can this therefore be closed?
I guess as long as solutions keep using temp files, this will be an issue.