dcache
dcache copied to clipboard
cleaner-hsm: `info` command needs more info, and better logging
Dear dCache devs,
I'm currently troubleshooting cleaner-hsm congestion (version 9.2.18). A challenge I'm facing is that it can't see which runs the cleaner has submitted to which pools, and which of these runs it is still expecting a reply from.
We have seen that our HSM script did too many retries for removals, and this caused timeouts between cleaner-hsm and pool. Since the cleaner will do retries anyway, we have now disabled retries for removals in our HSM script. Still, things are not as smoothly as I hoped. It currently is very difficult to see:
- which runs have been submitted
- when
- to which pools
- and what the status of each run is.
It would be very helpful if such information was shown by the info command.
As for logging: with logging set to debug, new jobs are shown as "New run...", without any useful information. Only when a job finishes, there is some relevant info:
24 May 2024 14:53:21 (cleaner-hsm) [marten10_lofopsstage PoolRemoveFilesFromHSM] Received a remove reply from pool marten10_lofopsstage, which is no longer waited for. The cleaner pool timeout might be too small.
24 May 2024 14:53:21 (cleaner-hsm) [marten10_lofopsstage PoolRemoveFilesFromHSM] Pool delete responses for HSM osm: 834 success, 1166 failures
It would be helpful if the "New run" log entry included more details, especially to which node the run is submitted (so that I may monitor it there).
Additionally, there is this setting in the cleaner-hsm:
https://github.com/dCache/dcache/blob/0597b1f47a47738a023b5fc81fe2fec684f80926/skel/share/defaults/cleaner-hsm.properties#L64
It's currently not clear (to me at least) what effect this setting has. Is it the maximum number of concurrent runs? It might help if the info command showed what these threads were doing.
If someone were to pick this up, it might be efficient to apply similar changes to the cleaner-disk.
Thanks!