snakemake
snakemake copied to clipboard
Left-over temp() files (previously #1732)
This is the same as https://github.com/snakemake/snakemake/issues/1732 which is marked as draft. Since my colleague who wrote the original issue is on vacation and I have no permissions to update the issue I created this one to replace the original issue. We will delete https://github.com/snakemake/snakemake/issues/1732 as soon as he is back.
Snakemake version Bug appeared in the 7.x series. The bugfix for temp files introduced in version 7.3 (and further fixed in 7.3.5) hasn't fixed all our issues. Bug still present as of 7.8.2
Describe the bug
Not all files tagged as temp() are cleaned-up. What we have observed while rummaging through logs and scratch directories.
Logs
This is the log from the minimal example below:
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cluster nodes: 1
Job stats:
job count min threads max threads
------------ ------- ------------- -------------
all 1 1 1
consume_data 1 1 1
create_data 1 1 1
total 3 1 1
Select jobs to execute...
[Thu Jun 30 17:00:26 2022]
group job a (jobs in lexicogr. order):
[Thu Jun 30 17:00:26 2022]
rule consume_data:
input: files/used_output
output: files/done.txt
jobid: 1
reason: Missing output files: files/done.txt; Input files updated by another job: files/used_output
resources: mem_mb=<TBD>, disk_mb=<TBD>, tmpdir=/tmp
[Thu Jun 30 17:00:26 2022]
rule create_data:
output: files/used_output, files/unused_output
jobid: 2
reason: Missing output files: files/used_output
resources: mem_mb=1000, disk_mb=1000, tmpdir=/tmp
Submitted group job 14f0bb70-acb7-57cb-9029-97367090779e with external jobid 'Job <223861310> is submitted to queue <normal.4h>.'.
[Thu Jun 30 17:00:56 2022]
Finished job 2.
[Thu Jun 30 17:00:56 2022]
Finished job 1.
2 of 3 steps (67%) done
Removing temporary output files/used_output.
Select jobs to execute...
[Thu Jun 30 17:00:56 2022]
localrule all:
input: files/done.txt
jobid: 0
reason: Input files updated by another job: files/done.txt
resources: mem_mb=1000, disk_mb=1000, tmpdir=/tmp
[Thu Jun 30 17:00:56 2022]
Finished job 0.
3 of 3 steps (100%) done
Complete log: .snakemake/log/2022-06-30T170026.229927.snakemake.log
Minimal example
The following example creates an temporary output file files/unused_output which is not used anywhere else in the workflow.
Running the workflow locally removes this file after completion.
Running it using snakemake --cluster "bsub" -j 1 does not cleanup this file.
Note:
Setting group and having at least 2 jobs in this group are required to trigger this error.
rule all:
input:
"files/done.txt",
rule create_data:
output:
used_output = temporary("files/used_output"),
unused_output = temporary("files/unused_output"),
group: "a"
shell:
"""
echo rule a > {output.used_output}
echo rule a > {output.unused_output}
"""
rule consume_data:
input:
froma = "files/used_output"
group: "a"
output:
done=("files/done.txt")
shell:
"""
touch {output.done}
"""
Further observations when debugging the minimal example from above:
- in
clustermode the file is not cleaned up since it does not appear infilesin https://github.com/snakemake/snakemake/blob/main/snakemake/dag.py#L688, this is the part for cleaning temp input files. - in non-cluster mode it is cleaned up in the "clean temp output files" section starting at https://github.com/snakemake/snakemake/blob/main/snakemake/dag.py#L690
@uweschmitt, just found this thread and realized it's related to a PR I just drafted involving group jobs and temp files #1765. I'm not sure what causes the bug here, but I'm wondering if you think they intersect at all. Another related issue is #1474, which seems to report a similar problem.
@pvandyken thanks! I checked out your changes and the minimal example above now cleans up files as expected.
Cool, I'm actually surprised, because my PR doesn't actually adjust the deleting of temp files, it just tells Snakemake to (sometimes) not worry about temp files already deleted. But if it works, it works
@uweschmitt :This also enables us to move some of the files that we currently store to the cluster-wide scratch and that Snakemake accidentally leaves behind, back to the node-local temp.
@uweschmitt I could not reproduce your finding that #1765 (unintentionally) fixes the issue here. It also did not remove files/unused_outputwhen running the minimal example on the cluster.
But I think I found the reason...
The finish method of the dag.DAG class receives a job argument that could either be a single job or a job group:
https://github.com/snakemake/snakemake/blob/c07732e2c7d6e95fabb3f2570bf0a1b0af2ac2cb/snakemake/dag.py#L1550-L1561
It's only a job group when run on the cluster, otherwise the individual jobs are passed to this method separately.
A few lines further down in finish, it defines a local variable named jobs that is either just the job group (if the job is a group) or a list containing but the single job:
https://github.com/snakemake/snakemake/blob/c07732e2c7d6e95fabb3f2570bf0a1b0af2ac2cb/snakemake/dag.py#L1568-L1571
It then iterates over that list/iterable and uses job as the loop variable:
https://github.com/snakemake/snakemake/blob/c07732e2c7d6e95fabb3f2570bf0a1b0af2ac2cb/snakemake/dag.py#L1595-L1599
So now the local job variable is always a single job. If the original job that was passed in was a group, the local variable now points to the last job from that group. And it only handles temporary files for that one job, forgetting about the others:
https://github.com/snakemake/snakemake/blob/c07732e2c7d6e95fabb3f2570bf0a1b0af2ac2cb/snakemake/dag.py#L1619
I think it should simply call handle_temp for all jobs from the original group, like so:
for job in jobs:
self.handle_temp(job)
(Would also not hurt to avoid reusing job. In other places the code usually calls the loop variable j.)
That certainly fixes the minimal example, but it also seems like the right thing to do in general. (I'm very new to this code base, so maybe take that with a grain of salt.)
CC @DrYak
@john-hen nice find, that makes much more sense. Note that self.handle_temp is able to properly handle group jobs, so if job was never reused, the original job variable could be passed in cleanly. But there's at least two loops where job is overwritten, and one of them also uses j (see the dependencies comprehension), so your solution would be the easiest.
@pvandyken You're right, I forgot about that, thanks for pointing that out. I'll create a PR on Monday (still have to familiarize myself with the contribution guidelines), but will go with whatever makes more sense. Maybe passing the original job to handle_temp is cleaner, I'll decide then.