dcache icon indicating copy to clipboard operation
dcache copied to clipboard

Potentially seeing double restores in 7.2

Open DmitryLitvintsev opened this issue 4 years ago • 7 comments

A user reported inability to access a file that was ONLINE_AND_NEARLINE. Our admin looked and reported that he observed PoolManager restore request while file was available on pool. Which is strange. I checked - the pool was up for a long time, so it does not look like pool happened to be down just in time user tried to access.

More interestingly it appears, that there were two restores of this file running in parallel:

File was restored on one pool at :

start restore: Tue Nov 2 07:33:11 CDT 2021 end restore: Tue Nov 2 09:00:04 CDT 2021

and on another pool:

start restore: Wed Nov 3 22:37:21 CDT 2021 end restore: Wed Nov 3 22:40:44 CDT 2021

Start/end times is when HSM started/finished. The actual restore requests were put in pool queues on 28 and 29 October respectively (according to billing DB).

         datestamp          | action  |                                      transaction                                       | errorcode | errormessage | connectiontime 
----------------------------+---------+----------------------------------------------------------------------------------------+-----------+--------------+----------------
 2021-10-29 10:43:07.166-05 | restore | pool:rw-protodune-stkendca1809-1@rw-protodune-stkendca1809-1Domain:1635522187166-12915 |         0 |              |         202263
 2021-10-28 11:50:33.885-05 | restore | pool:rw-protodune-stkendca1807-1@rw-protodune-stkendca1807-1Domain:1635439833885-12753 |         0 |              |        5212326
(2 rows)

User was trying to access the file multiple times on Nov 3rd:

          datestamp          | connectiontime | errorcode |     errormessage     
----------------------------+----------------+-----------+----------------------

 2021-11-03 19:51:40.033-05 |          25105 |     10011 | Transfer interrupted
 2021-11-03 19:51:14.835-05 |          60060 |     10011 | Transfer interrupted
 2021-11-03 19:50:14.673-05 |          60061 |     10011 | Transfer interrupted
 2021-11-03 19:49:14.562-05 |          60060 |     10011 | Transfer interrupted
 2021-11-03 19:48:14.433-05 |          60060 |     10011 | Transfer interrupted
 2021-11-03 19:47:14.252-05 |          60039 |     10011 | Transfer interrupted

(and complained that file was not accessible. User claims that locality of file at the time was reporting to be ONLINE_AND_NEARLINE)

Goes without saying that second restore should not have been triggered. I will investigate more to see if
if there is some misconfiguration or cost the first pool was too high and PoolManager chose staging rather than queuing. Both pools are members of the same pool group. So this is just a heads up and to keep track of it for now.

DmitryLitvintsev avatar Nov 06 '21 18:11 DmitryLitvintsev

Further investigation tells me that PoolManager was restarted on Oct 29th. This could explain what happened.

DmitryLitvintsev avatar Nov 07 '21 17:11 DmitryLitvintsev

Could you check how prevalent is this issue? Is it just the one file (or a few files) that was "double staged" or is it affecting a significant fraction of staging requests?

paulmillar avatar Nov 08 '21 13:11 paulmillar

The scope could be massive. Say you have 100K restore requests queued on the pools. You restart PoolManager and user start pre-staging the same files. If I understand correctly this will result in double restores. Why users keep pre-staging? Because the way they interact with the system - they check if file is ONLINE and it it is not, they prestage it (dccp -P) end repeat this until file is ONLINE. Could be as many restore as PoolManager restarts. But even worse. Looks like while second restore is in PoolManager queue, and the first restore completed, user sees the file as ONLINE_AND_NERALINE. User submits the job, but nothing seems to happen because of the restore request in the PoolManager. This is what drew user ire here at Fermilab. I need to double check this, but this is what seems to be happening.

DmitryLitvintsev avatar Nov 08 '21 13:11 DmitryLitvintsev

What you say is true: the problem certainly could be severe.

However, I was wondering how many files have actually been staged onto separate pools? (either in absolute numbers or as a fraction of all files being staged).

You mentioned one user reporting a problem with one file, which is why I'm asking.

paulmillar avatar Nov 08 '21 14:11 paulmillar

Correct me if I am mistaken here. This my estimate of the effect of PoolManager restart:

billing=# with foo as (select count(*) as counter, pnfsid from storageinfo where action='restore' and errorcode = 0 and datestamp > '2021-10-21 00:00:00' group by pnfsid having count(*) > 1) select count(*) from foo;
 count 
-------
 61617
(1 row)

DmitryLitvintsev avatar Nov 08 '21 15:11 DmitryLitvintsev

OK, thanks.

Do these multi-pool stages come in bursts and correlate to poolmanager restarts?

paulmillar avatar Nov 08 '21 15:11 paulmillar

yes they do correlate with poolmanager restarts. They come in groups (user agents pre-staging data by dataset)

DmitryLitvintsev avatar Nov 08 '21 15:11 DmitryLitvintsev