snakemake icon indicating copy to clipboard operation
snakemake copied to clipboard

Left-over temp() files (previously #1732)

Open uweschmitt opened this issue 3 years ago • 8 comments

This is the same as https://github.com/snakemake/snakemake/issues/1732 which is marked as draft. Since my colleague who wrote the original issue is on vacation and I have no permissions to update the issue I created this one to replace the original issue. We will delete https://github.com/snakemake/snakemake/issues/1732 as soon as he is back.

Snakemake version Bug appeared in the 7.x series. The bugfix for temp files introduced in version 7.3 (and further fixed in 7.3.5) hasn't fixed all our issues. Bug still present as of 7.8.2

Describe the bug

Not all files tagged as temp() are cleaned-up. What we have observed while rummaging through logs and scratch directories.

Logs

This is the log from the minimal example below:

Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cluster nodes: 1
Job stats:
job             count    min threads    max threads
------------  -------  -------------  -------------
all                 1              1              1
consume_data        1              1              1
create_data         1              1              1
total               3              1              1

Select jobs to execute...
[Thu Jun 30 17:00:26 2022]

group job a (jobs in lexicogr. order):

    [Thu Jun 30 17:00:26 2022]
    rule consume_data:
        input: files/used_output
        output: files/done.txt
        jobid: 1
        reason: Missing output files: files/done.txt; Input files updated by another job: files/used_output
        resources: mem_mb=<TBD>, disk_mb=<TBD>, tmpdir=/tmp


    [Thu Jun 30 17:00:26 2022]
    rule create_data:
        output: files/used_output, files/unused_output
        jobid: 2
        reason: Missing output files: files/used_output
        resources: mem_mb=1000, disk_mb=1000, tmpdir=/tmp

Submitted group job 14f0bb70-acb7-57cb-9029-97367090779e with external jobid 'Job <223861310> is submitted to queue <normal.4h>.'.
[Thu Jun 30 17:00:56 2022]
Finished job 2.
[Thu Jun 30 17:00:56 2022]
Finished job 1.
2 of 3 steps (67%) done
Removing temporary output files/used_output.
Select jobs to execute...

[Thu Jun 30 17:00:56 2022]
localrule all:
    input: files/done.txt
    jobid: 0
    reason: Input files updated by another job: files/done.txt
    resources: mem_mb=1000, disk_mb=1000, tmpdir=/tmp

[Thu Jun 30 17:00:56 2022]
Finished job 0.
3 of 3 steps (100%) done
Complete log: .snakemake/log/2022-06-30T170026.229927.snakemake.log

Minimal example

The following example creates an temporary output file files/unused_output which is not used anywhere else in the workflow. Running the workflow locally removes this file after completion.

Running it using snakemake --cluster "bsub" -j 1 does not cleanup this file.

Note: Setting group and having at least 2 jobs in this group are required to trigger this error.

rule all:
    input:
        "files/done.txt",

rule create_data:
    output:
        used_output = temporary("files/used_output"),
        unused_output = temporary("files/unused_output"),  
    group: "a"
    shell:
        """
        echo rule a > {output.used_output}
        echo rule a > {output.unused_output}
        """

rule consume_data:
    input:
        froma = "files/used_output"
    group: "a"
    output:
        done=("files/done.txt")
    shell:
        """
        touch {output.done}
        """

uweschmitt avatar Jun 30 '22 15:06 uweschmitt

Further observations when debugging the minimal example from above:

  • in cluster mode the file is not cleaned up since it does not appear in files in https://github.com/snakemake/snakemake/blob/main/snakemake/dag.py#L688, this is the part for cleaning temp input files.
  • in non-cluster mode it is cleaned up in the "clean temp output files" section starting at https://github.com/snakemake/snakemake/blob/main/snakemake/dag.py#L690

uweschmitt avatar Jul 04 '22 13:07 uweschmitt

@uweschmitt, just found this thread and realized it's related to a PR I just drafted involving group jobs and temp files #1765. I'm not sure what causes the bug here, but I'm wondering if you think they intersect at all. Another related issue is #1474, which seems to report a similar problem.

pvandyken avatar Jul 11 '22 21:07 pvandyken

@pvandyken thanks! I checked out your changes and the minimal example above now cleans up files as expected.

uweschmitt avatar Jul 14 '22 08:07 uweschmitt

Cool, I'm actually surprised, because my PR doesn't actually adjust the deleting of temp files, it just tells Snakemake to (sometimes) not worry about temp files already deleted. But if it works, it works

pvandyken avatar Jul 14 '22 15:07 pvandyken

@uweschmitt :This also enables us to move some of the files that we currently store to the cluster-wide scratch and that Snakemake accidentally leaves behind, back to the node-local temp.

DrYak avatar Jul 17 '22 07:07 DrYak

@uweschmitt I could not reproduce your finding that #1765 (unintentionally) fixes the issue here. It also did not remove files/unused_outputwhen running the minimal example on the cluster.

But I think I found the reason...

The finish method of the dag.DAG class receives a job argument that could either be a single job or a job group: https://github.com/snakemake/snakemake/blob/c07732e2c7d6e95fabb3f2570bf0a1b0af2ac2cb/snakemake/dag.py#L1550-L1561

It's only a job group when run on the cluster, otherwise the individual jobs are passed to this method separately.

A few lines further down in finish, it defines a local variable named jobs that is either just the job group (if the job is a group) or a list containing but the single job: https://github.com/snakemake/snakemake/blob/c07732e2c7d6e95fabb3f2570bf0a1b0af2ac2cb/snakemake/dag.py#L1568-L1571

It then iterates over that list/iterable and uses job as the loop variable: https://github.com/snakemake/snakemake/blob/c07732e2c7d6e95fabb3f2570bf0a1b0af2ac2cb/snakemake/dag.py#L1595-L1599

So now the local job variable is always a single job. If the original job that was passed in was a group, the local variable now points to the last job from that group. And it only handles temporary files for that one job, forgetting about the others: https://github.com/snakemake/snakemake/blob/c07732e2c7d6e95fabb3f2570bf0a1b0af2ac2cb/snakemake/dag.py#L1619

I think it should simply call handle_temp for all jobs from the original group, like so:

for job in jobs:
    self.handle_temp(job)

(Would also not hurt to avoid reusing job. In other places the code usually calls the loop variable j.)

That certainly fixes the minimal example, but it also seems like the right thing to do in general. (I'm very new to this code base, so maybe take that with a grain of salt.)

CC @DrYak

john-hen avatar Jul 22 '22 15:07 john-hen

@john-hen nice find, that makes much more sense. Note that self.handle_temp is able to properly handle group jobs, so if job was never reused, the original job variable could be passed in cleanly. But there's at least two loops where job is overwritten, and one of them also uses j (see the dependencies comprehension), so your solution would be the easiest.

pvandyken avatar Jul 22 '22 16:07 pvandyken

@pvandyken You're right, I forgot about that, thanks for pointing that out. I'll create a PR on Monday (still have to familiarize myself with the contribution guidelines), but will go with whatever makes more sense. Maybe passing the original job to handle_temp is cleaner, I'll decide then.

john-hen avatar Jul 22 '22 16:07 john-hen