ezpublish-legacy icon indicating copy to clipboard operation
ezpublish-legacy copied to clipboard

EZP-24364 : clusterpurge enhancement to avoid infinite loop

Open tharkun opened this issue 9 years ago • 10 comments

The modification are meant to avoid infinite loop.

tharkun avatar May 12 '15 20:05 tharkun

ping @bdunogier

andrerom avatar May 12 '15 21:05 andrerom

FWIW: I am not certain if this resolves the (unexpected) condition of not being able to purge files.

My 2 cents: If certain files cannot be removed from the cluster, in theory requesting other, different files can avoid an infinite loop caused by fetching the same entries over and over again.

However, why can't the files be removed? Do they exist? And what if none of the files can be removed? The issue will still occur, regardless of the order...

Possible alternative: If certain files cannot be purged, throw a warning and "blacklist" them on the select query, until a limit is reached, for performance/memory reasons (then throw an exception?)

joaoinacio avatar May 13 '15 09:05 joaoinacio

Very good points, @joaoinacio.

This PR probably improves/works around the situation, but I'm also interested in why files aren't removed. If a file can not be removed, maybe it must be flagged as such, or removed, from the cluster tables ?

bdunogier avatar May 13 '15 09:05 bdunogier

My 2c: this code seems a bit too complex for the problem at hand - I do not think it warrants adding apis inside the clusterfilehandler (think about oracle support as well)

gggeek avatar May 13 '15 09:05 gggeek

The purpose of this PR is not to fix the reason why the files can't be deleted from NFS or in DB. There are too many reason why they can't be deleted (encoding, server errors...).

In my case, both the file in NFS and the line in MySQL are not deleted, probably because of encoding issues.

The idea is just to avoid infinite loop. Basically, the clusterpurge script is launched every week or month. You can end up with serveral scripts running at the same time.

@joaoinacio : you 're right. We could do it this way, but it means we'll have to change much more code in the cluster handler which is stable.

tharkun avatar May 13 '15 10:05 tharkun

@tharkun now it is getting interesting.

Without getting into details, I'd rather prevent the items from being returned than randomizing the query. What could happen if we just deleted the record from the database ?

bdunogier avatar May 13 '15 12:05 bdunogier

@bdunogier deleting the record from the database would not break anything. I will certainly do it manually as there are only few items compared to the number of records in dfs tables. Still there is no guarantee that the problem won't happen again, and the script could go again in infinite loop because there is nothing to stop the while loop.

The only reason why i added a random function to the query was to prevent these error items from being returned.

tharkun avatar May 13 '15 13:05 tharkun

We look after a few clustered instances and do not have this problem. If it did appear, I would suggest that we put in a hack to (1) remove the record from the db and (2) log the occurrence. That way, we might stand a chance of tracking down the fundamental problem.

And a total shot in the dark: if you are using a db cluster, make certain you specify that the script use the admin siteaccess.

dougplant avatar May 13 '15 15:05 dougplant

@bdunogier https://github.com/bdunogier deleting the record from the database would not break anything. I will certainly do it manually as there are only few items compared to the number of records in dfs tables. Still there is no guarantee that the problem won't happen again, and the script could go again in infinite loop because there is nothing to stop the while loop.

The only reason why i added a random function to the query was to prevent these error items from being returned.

If faulty items are deleted from the table, they won't be returned next time, and the issue won't happen. Or am I missing something ?

bdunogier avatar May 13 '15 17:05 bdunogier

Is it caused by combination of sleep and entries continuously being expired in the table maybe? Or on high concurrency it can probably happen easily while the loop is just busy purging items even.

andrerom avatar May 18 '15 12:05 andrerom