dcache
dcache copied to clipboard
Potentially seeing double restores in 7.2
A user reported inability to access a file that was ONLINE_AND_NEARLINE. Our admin looked and reported that he observed PoolManager restore request while file was available on pool. Which is strange. I checked - the pool was up for a long time, so it does not look like pool happened to be down just in time user tried to access.
More interestingly it appears, that there were two restores of this file running in parallel:
File was restored on one pool at :
start restore: Tue Nov 2 07:33:11 CDT 2021 end restore: Tue Nov 2 09:00:04 CDT 2021
and on another pool:
start restore: Wed Nov 3 22:37:21 CDT 2021 end restore: Wed Nov 3 22:40:44 CDT 2021
Start/end times is when HSM started/finished. The actual restore requests were put in pool queues on 28 and 29 October respectively (according to billing DB).
datestamp | action | transaction | errorcode | errormessage | connectiontime
----------------------------+---------+----------------------------------------------------------------------------------------+-----------+--------------+----------------
2021-10-29 10:43:07.166-05 | restore | pool:rw-protodune-stkendca1809-1@rw-protodune-stkendca1809-1Domain:1635522187166-12915 | 0 | | 202263
2021-10-28 11:50:33.885-05 | restore | pool:rw-protodune-stkendca1807-1@rw-protodune-stkendca1807-1Domain:1635439833885-12753 | 0 | | 5212326
(2 rows)
User was trying to access the file multiple times on Nov 3rd:
 datestamp | connectiontime | errorcode | errormessage
----------------------------+----------------+-----------+----------------------
2021-11-03 19:51:40.033-05 | 25105 | 10011 | Transfer interrupted
2021-11-03 19:51:14.835-05 | 60060 | 10011 | Transfer interrupted
2021-11-03 19:50:14.673-05 | 60061 | 10011 | Transfer interrupted
2021-11-03 19:49:14.562-05 | 60060 | 10011 | Transfer interrupted
2021-11-03 19:48:14.433-05 | 60060 | 10011 | Transfer interrupted
2021-11-03 19:47:14.252-05 | 60039 | 10011 | Transfer interrupted
(and complained that file was not accessible. User claims that locality of file at the time was reporting to be ONLINE_AND_NEARLINE)
Goes without saying that second restore should not have been triggered. I will investigate more to see if
if there is some misconfiguration or cost the first pool was too high and PoolManager chose staging rather than
queuing. Both pools are members of the same pool group. So this is just a heads up and to keep track of it for now.
Further investigation tells me that PoolManager was restarted on Oct 29th. This could explain what happened.
Could you check how prevalent is this issue? Is it just the one file (or a few files) that was "double staged" or is it affecting a significant fraction of staging requests?
The scope could be massive. Say you have 100K restore requests queued on the pools. You restart PoolManager and user start pre-staging the same files. If I understand correctly this will result in double restores. Why users keep pre-staging? Because the way they interact with the system - they check if file is ONLINE and it it is not, they prestage it (dccp -P) end repeat this until file is ONLINE. Could be as many restore as PoolManager restarts. But even worse. Looks like while second restore is in PoolManager queue, and the first restore completed, user sees the file as ONLINE_AND_NERALINE. User submits the job, but nothing seems to happen because of the restore request in the PoolManager. This is what drew user ire here at Fermilab. I need to double check this, but this is what seems to be happening.
What you say is true: the problem certainly could be severe.
However, I was wondering how many files have actually been staged onto separate pools? (either in absolute numbers or as a fraction of all files being staged).
You mentioned one user reporting a problem with one file, which is why I'm asking.
Correct me if I am mistaken here. This my estimate of the effect of PoolManager restart:
billing=# with foo as (select count(*) as counter, pnfsid from storageinfo where action='restore' and errorcode = 0 and datestamp > '2021-10-21 00:00:00' group by pnfsid having count(*) > 1) select count(*) from foo;
count
-------
61617
(1 row)
OK, thanks.
Do these multi-pool stages come in bursts and correlate to poolmanager restarts?
yes they do correlate with poolmanager restarts. They come in groups (user agents pre-staging data by dataset)