cylc-flow job polling broken for failed jobs after restart

tldr;

Failed tasks can be polled back to incorrect states on restart.

Bug:

After a restart Cylc updates task proxies with the owner@host pair of submitted/running jobs to allow polling:

https://github.com/cylc/cylc-flow/blob/5ef44194a3e28f21944ae894c52f0c523d590f94/lib/cylc/task_pool.py#L361-L369

This, however, excludes succeeded and failed tasks. Consequently, following restart remote tasks do not have their owner@host loaded from the DB which causes polling to run locally.

Polling will most likely fail but could also produce unexpected results (particularly for the case of background jobs).

This may be related to #1792 which extended polling to succeeded / failed tasks but didn't extend the owner@host update logic:

https://github.com/cylc/cylc-flow/pull/2396/files#diff-1f1aa9b850f9d1655a22322beb0e2d0604fb816b3bc807210120547f1a35ae24

When this effect is combined with a task failing by hitting execution time limit on a remote batch system (that is not pollable locally) this causes the task to be polled back to running.

Reproducible Example:

[scheduling]                                                                
    [[dependencies]]                                                        
        graph = """                                                         
            a                                                               
            a:fail => restart                                               
        """                                                                 
                                                                            
[runtime]                                                                   
    [[a]]                                                                   
        script = """                                                        
            sleep 60
        """
        [[[remote]]]
            host = <host>
        [[[job]]]
            execution time limit = PT1S
            batch system = pbs                                                                           
    [[restart]]                    
        script = """               
            cylc stop "${CYLC_SUITE_NAME}" --now --now    
            sleep 5                
            cylc restart "${CYLC_SUITE_NAME}" --host=localhost              
        """

Log Snippet (post-restart):

LOADING task proxies                                                                                     
+ a.1 failed    
+ restart.1 running    
LOADING task action timers    
+ a.1 [[u'job-logs-retrieve', u'failed'], 1]    
+ a.1 [u'try_timers', u'retrying']    
+ a.1 [u'try_timers', u'submit-retrying']    
+ restart.1 poll_timer    
+ restart.1 [u'try_timers', u'retrying']    
+ restart.1 [u'try_timers', u'submit-retrying']    
[a.1] status=failed: (polled)succeeded at 2021-11-16T10:15:19Z for job(01)           <= ERROR
[restart.1] status=running: (polled)succeeded at 2021-11-16T10:17:12Z for job(01)

Pull requests welcome! This is an Open Source project - please consider contributing a bug fix yourself (please read CONTRIBUTING.md before starting any work though).

Nov 16 '21 10:11 oliver-sanders

I can't test this with Cylc 8 at the moment, however, I expect the bug will likely be present there too.

Nov 16 '21 10:11 oliver-sanders

The solution is presumably to update the owner@host for succeeded and failed tasks. Will need to check the logic to ensure this doesn't produce any unexpected side effects in other parts of the code e.g. host-selection.

Nov 16 '21 14:11 oliver-sanders

Cylc 8 issue - https://github.com/cylc/cylc-flow/issues/4513

Nov 29 '21 14:11 oliver-sanders

Cylc 8 issue - #4513

#4513 is really a different issue (polling doing the wrong thing). This issue is about polling happening on the wrong platform.

I've confirmed that this remains an issue at Cylc 8. I've had 2 recent reports of this problem so I think we really need to get it fixed in both Cylc 7 & 8.

Jul 07 '22 16:07 dpmatthews

Closed by #5016

Sep 14 '22 10:09 oliver-sanders

cylc-flow cylc-flow copied to clipboard

job polling broken for failed jobs after restart

cylc-flow
cylc-flow copied to clipboard