luigi icon indicating copy to clipboard operation
luigi copied to clipboard

Completed tasks should not be removed from the cache

Open BlackIsTheNewBlack opened this issue 4 years ago • 6 comments

Hello, This is in continuation of #2925
Completed tasks should most probably not get garbage collected, as that destroys their precious state, as well the assumption that a task with the same significant parameters is the same object. The complete() method can potentially be quite expensive to compute, anda completed task may also have valuable information which was computed and stored as state. TaskProcess.run checks the completion of deps right before running the task, so a new empty instance is always created right after termination, and forcing complete to be evaluated again from scratch. If the completion persistence check is done using the network, that is simply doubles requests. My recommendation is to keep completed tasks until the end of all tasks. That should improve performance significantly and simplify the programming model.

BlackIsTheNewBlack avatar Apr 24 '20 17:04 BlackIsTheNewBlack

I believe one of the principles of luigi is that you don't depend on temporary states (that could be garbage collected, for instance). That's why most Tasks have complete states that are written to either a file or a database.

guidopetri avatar Apr 27 '20 22:04 guidopetri

Hello. Some temporary state is unavoidable. May avoiding temporary state be a principle or just something to be desired, it is clear that the task register does not respect its own fundamental principle (task_register.py line 46):

1. Cache instances of objects so that eg. ``X(1, 2, 3)`` always returns the
   same object.

I suspect that a lot of code depends implicitily on this behaviour, as it is central to Luigi. Fixing it will even improve performance in many unexpected places, for example it will not check twice for completion of a AWS task.

BlackIsTheNewBlack avatar May 06 '20 07:05 BlackIsTheNewBlack

anda completed task may also have valuable information which was computed and stored as state

I imagine one way to think in the luigi model (which I understand doesn't work well everywhere) for this situation could be to (a) Let the end of your task write this "state" to whatever globally accessible state in a .COMPLETE/.MARKER file or something. Would that be possible?

As for TaskRegister I think we implicitly meant that it's for withon one Python process :)

Tarrasch avatar May 30 '20 16:05 Tarrasch

Let the end of your task write this "state" to whatever globally accessible state in a .COMPLETE/.MARKER file or something. Would that be possible?

Sure it is possible, that's how we are using it. But if there are many dependencies for any task (think 1000 tasks or files), those will need to be checked twice for any task, one in normal usage, and then again after the task is discarded from memory. It is twice slower and more complex for no good reason.

BlackIsTheNewBlack avatar Jun 02 '20 18:06 BlackIsTheNewBlack

Hello, Just to to revive this issue and summarize the problem: task register does not respect its own fundamental principle (task_register.py line 46):

  1. Cache instances of objects so that eg. X(1, 2, 3) always returns the same object.

This has: a. correctness implications (because a task may be resurrected and executed twice) b. performance implications (because its status will be computed at least twice, which may be expensive and the status has always to be persisted). c. complexity implications - status has to be persisted and perhaps broken in multiple parts (as Tarrasch suggested here https://github.com/spotify/luigi/issues/2935#issuecomment-636354983)

Fixing this bug will automatically improve performance for workflows with many tasks or with remote tasks. Thanks!

BlackIsTheNewBlack avatar Aug 28 '20 12:08 BlackIsTheNewBlack

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If closed, you may revisit when your time allows and reopen! Thank you for your contributions.

stale[bot] avatar Jan 09 '22 02:01 stale[bot]