dcache icon indicating copy to clipboard operation
dcache copied to clipboard

Pin request TTL exceeded [10006]

Open qiulan2021 opened this issue 4 years ago • 21 comments

We got Pin request TTL exceeded issue.

utilityDomain.log:12 Nov 2021 00:27:11 (PinManager) [toQ:24472:srm2:bringOnline:-2146816274:-2146816273 SrmManager PinManagerPin] Failed to pin 000099D215D4B1714314BB1C60BE5EADF899: Pin request TTL exceeded [10006] utilityDomain.log:12 Nov 2021 01:26:54 (PinManager) [toQ:24472:srm2:bringOnline:-2146816274:-2146816273 SrmManager PinManagerPin] Failed to pin 000099D215D4B1714314BB1C60BE5EADF899: Pin request TTL exceeded [10006] utilityDomain.log:12 Nov 2021 02:26:26 (PinManager) [toQ:24472:srm2:bringOnline:-2146816274:-2146816273 SrmManager PinManagerPin] Failed to pin 000099D215D4B1714314BB1C60BE5EADF899: Pin request TTL exceeded [10006] utilityDomain.log:12 Nov 2021 03:26:06 (PinManager) [toQ:24472:srm2:bringOnline:-2146816274:-2146816273 SrmManager PinManagerPin] Failed to pin 000099D215D4B1714314BB1C60BE5EADF899: Pin request TTL exceeded [10006] utilityDomain.log:12 Nov 2021 04:25:45 (PinManager) [toQ:24472:srm2:bringOnline:-2146816274:-2146816273 SrmManager PinManagerPin] Failed to pin 000099D215D4B1714314BB1C60BE5EADF899: Pin request TTL exceeded [10006] utilityDomain.log:12 Nov 2021 05:25:21 (PinManager) [toQ:24472:srm2:bringOnline:-2146816274:-2146816273 SrmManager PinManagerPin] Failed to pin 000099D215D4B1714314BB1C60BE5EADF899: Pin request TTL exceeded [10006] utilityDomain.log:12 Nov 2021 06:25:00 (PinManager) [toQ:24472:srm2:bringOnline:-2146816274:-2146816273 SrmManager PinManagerPin] Failed to pin 000099D215D4B1714314BB1C60BE5EADF899: Pin request TTL exceeded [10006] utilityDomain.log:12 Nov 2021 07:24:39 (PinManager) [toQ:24472:srm2:bringOnline:-2146816274:-2146816273 SrmManager PinManagerPin] Failed to pin 000099D215D4B1714314BB1C60BE5EADF899: Pin request TTL exceeded [10006] utilityDomain.log:12 Nov 2021 08:24:26 (PinManager) [toQ:24472:srm2:bringOnline:-2146816274:-2146816273 SrmManager PinManagerPin] Failed to pin 000099D215D4B1714314BB1C60BE5EADF899: Pin request TTL exceeded [10006]

The files failed to stage with this issue.

qiulan2021 avatar Nov 12 '21 22:11 qiulan2021

There are different entries on the Postgres database associated with this:

2021-11-14 21:30:28.948 EST [27376] STATEMENT:  UPDATE pins SET state = $1,request_id = $2 WHERE id = $3
2021-11-14 23:05:53.802 EST [11690] ERROR:  could not serialize access due to concurrent delete
2021-11-14 23:05:53.802 EST [11690] STATEMENT:  UPDATE pins SET state = $1,request_id = $2 WHERE id = $3
2021-11-14 23:05:53.850 EST [11118] ERROR:  could not serialize access due to concurrent update
2021-11-14 23:05:53.850 EST [11118] STATEMENT:  UPDATE pins SET state = $1,request_id = $2 WHERE id = $3

cfgamboa avatar Nov 15 '21 13:11 cfgamboa

@paulmillar is there any parameter that alleviates the contention observed to be set at the pin manager side?

cfgamboa avatar Nov 15 '21 13:11 cfgamboa

Hi. Which dCache version is this?

lemora avatar Nov 15 '21 15:11 lemora

It's dcache 7.2.3

qiulan2021 avatar Nov 15 '21 15:11 qiulan2021

Okay. Do you use a single PinManager or several in this instance?

lemora avatar Nov 15 '21 16:11 lemora

Single one.

On Nov 15, 2021, at 11:10 AM, Lea @.***> wrote:

Okay. Do you use a single PinManager or several in this instance?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dCache/dcache/issues/6251#issuecomment-969065750, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHIHMO53VSMWPQGIUP4VKF3UMEWFZANCNFSM5H5Y7TQA. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

cfgamboa avatar Nov 15 '21 16:11 cfgamboa

Could you please post your PinManager config -- did you change any values of the new fields (pinmanager.max-unpins-per-run and pinmanager.reset-failed-unpins-period) or other ones?

lemora avatar Nov 15 '21 16:11 lemora

Also, how large is your pins database table?

lemora avatar Nov 15 '21 16:11 lemora

No those parameter were not changed,

The parameters enabled last Friday are

pinmanager.cell.max-message-threads = 100 pinmanager.cell.max-messages-queued = 10000

Currently these are the parameters explicitly set:

pinmanager.db.connections.max=150 pinmanager.cell.max-message-threads = 100 pinmanager.cell.max-messages-queued = 10000 pnfsmanager.limits.threads=12 pinmanager.limits.pin-duration=259200000

Other will follow up the defaults

The pins table has about 10K rows.

Carlos

On Nov 15, 2021, at 11:13 AM, Lea @.***> wrote:

Could you please post your PinManager config -- did you change any values of the new fields (pinmanager.max-unpins-per-run and pinmanager.reset-failed-unpins-period) or other ones?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dCache/dcache/issues/6251#issuecomment-969069737, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHIHMO637MHFHTT7T74ITFLUMEWT5ANCNFSM5H5Y7TQA. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

cfgamboa avatar Nov 15 '21 16:11 cfgamboa

I see the same error.

14 Nov 2021 09:50:50 [PinManager-64499] [ZiA:1937339:srm2:bringOnline:-2051307565:-2051307515 SrmManager PinManagerPin] Failed to pin 00004F17E23E66F441299D26E1711B13E286: Pin request TTL exceeded [10006]

7.2

DmitryLitvintsev avatar Nov 15 '21 21:11 DmitryLitvintsev

srm=# select count(*), state from pins group by state;
 count  |     state      
--------+----------------
 272176 | PINNED
  23315 | PINNING
     55 | READY_TO_UNPIN
(3 rows)

No HA, each service is singlet no special setting other than:

pinmanager.db.connections.max = 128
pinmanager.limits.pin-duration = 3
pinmanager.limits.pin-duration.unit = DAYS

bring-online fail rate:

 srm=# select count(*), js.state from bringonlinerequests bo, srmjobstate js where js.id = bo.state group by js.state;
 count | state  
-------+--------
  5939 | Failed
  2474 | Done
(2 rows)

distribution by error:

srm=# select count(*), bo.errormessage from bringonlinerequests bo, srmjobstate js where js.id = bo.state and js.state = 'Failed' group by bo.errormessage;
 count |             errormessage             
-------+--------------------------------------
  2932 | File requests have failed.
  3007 | Request lifetime (12 hours) expired.
(2 rows)

So as you can see bring-online hardly works.

DmitryLitvintsev avatar Nov 15 '21 21:11 DmitryLitvintsev

Hi all,

The error message Pin request TTL exceeded [10006] indicates a problem when pinmanager is talking with other dCache components. Unfortunately, the message is not so precise: it doesn't say with which component was the problem and what exactly went wrong. To give you some examples: the problem could come from talking with PnfsManager, PoolManager or the pool. The problem could be that the remote dCache service (pnfsmanager/poolmanager/pool) took too long to reply, there's no network connection to that component, or pinmanager "didn't like" the response.

To put this in some context, when pinmanager receives a request to pin a file (typically from a door, like SRM), that request contains a lifetime. This lifetime is set by the door. It's the time the door will wait for pinmanager to pin the file before "giving up" and adopting a recovery strategy (e.g., reporting an error back to client).

When there's a problem with pinmanager communicating with another dCache service, pinmanager will wait "a bit" (either 30 seconds or 10 ms, depending) and have another go. However, before "sleeping" for the retrying time, pinmanager will check that the request is still valid (lifetime won't have expired) after the delay. If the the request will not be valid after the delay then pinmanager reports the above error message.

The lifetime for a pin request is typically quite large (e.g., Dmitry's logs show 12 hours). Therefore, the problem appears to be non-transient.

So, (all in all) the problem (likely) lies outside of pinmanager itself, but more information is needed to understand exactly what is going wrong.

Aside: the message could not serialize access due to concurrent delete is likely harmless. It's a consequence of how pinmanager uses the database and simply indicates that the database is somewhat struggling to keep up. It's an error that's anticipated in pinmanager's design and should simply trigger an internal retry when writing data to the database.

paulmillar avatar Nov 16 '21 09:11 paulmillar

Patch: https://rb.dcache.org/r/13266/

paulmillar avatar Nov 16 '21 12:11 paulmillar

Here is an example:

16 Nov 2021 09:13:14 [PinManager-70088] [ZiA:1984379:srm2:bringOnline:-2051264169:-2051264135 SrmManager PinManagerPin] Failed to pin 0000DEFEACE9743A4AD6A9D673EF7EF38038: Pin request TTL exceeded [10006]

I see a queued restore request for this file :

   e6303a13-007f-4179-b8bc-e50d2cb1d997 QUEUED Tue Nov 16 02:13:08 CST 2021 0000DEFEACE9743A4AD6A9D673EF7EF38038 dune.protodune-sp-full-reconstructed

Well... SRM error associated with this request looks interesting:

srm=# select id, UNIX2TIMESTAMP(creationtime),  UNIX2TIMESTAMP(laststatetransitiontime), errormessage from bringonlinefilerequests where requestid = -2051264169  and id = -2051264135;
     id      |     unix2timestamp     |     unix2timestamp     |                                                                                                            
               errormessage                                                                                                                            
-------------+------------------------+------------------------+------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------------------------------------------------------------------
 -2051264135 | 2021-11-16 02:13:08-06 | 2021-11-16 09:44:31-06 | Failed to pin file [rc=10011,msg=org.springframework.transaction.CannotCreateTransactionException: Could no
t open JDBC Connection for transaction; nested exception is java.sql.SQLException: HikariDataSource HikariDataSource (HikariPool-1) has been closed.].
(1 row)

DmitryLitvintsev avatar Nov 16 '21 15:11 DmitryLitvintsev

uses the database and simply indicates that the database is somewhat struggling to keep up

Looking at the database instance average execution time for query response is about .1 sec. So it does not seem to be struggling at least nothing obvious.

cfgamboa avatar Nov 17 '21 16:11 cfgamboa

I have seen a bunch of these log messages Failed to pin 0000360C77B27FE841068DC55777FE877644: Pin request TTL exceeded [10006] before; already in 6.2. It seems to be related to srm specifically.

Have you checked whether this also occurs in previous versions under higher srm request load?

lemora avatar Nov 18 '21 09:11 lemora

Since this seems to only affect bring-online request, my current suspicion is that the PoolManager (who in turn triggers staging the requested files to appropriate pools) requests run into the default timeouts of 1h repeatedly. If the staging load is high enough, it eventually ends up failing because the general pin request lifetime is exceeded. Fetching data from tape can be expected to take a longer time.

Paul's patch will come in handy to understand if that could indeed be the case. Otherwise one might also try increasing the value for the following property (as well as the max pin request lifetime):

pinmanager.service.poolmanager.timeout=3600
pinmanager.service.poolmanager.timeout.unit=SECONDS

lemora avatar Nov 18 '21 10:11 lemora

Ok what are the next steps to further debug this? Are we waiting on a new dCache release.

dougbenjamin avatar Nov 23 '21 16:11 dougbenjamin

Have you tried tweaking your configuration according to our suggestions? If you have, does it help? If you haven't, it would be good if you tried. There is currently no change to how these components work, Paul only introduced additional logging that may provide additional insights.

As I said, this does not seem to be new behaviour as far as I could see, and may need to be handled via configuration if access pattern intensity changes. Additional data points that point to something else are of course welcome. I will continue looking into this though.

lemora avatar Nov 23 '21 17:11 lemora

Just an FYI, at Fermilab we have set:

pinmanager.service.poolmanager.timeout=604800

(one week). Have not seen these errors since, not have seen any adverse behavior.

DmitryLitvintsev avatar Nov 23 '21 18:11 DmitryLitvintsev

We (ie me) have update the timeout setting to 604800 like FNAL. restarted dCache and asked for another 100 TB transfer from tape (SRM+https). Will post an update when we have it.

dougbenjamin avatar Nov 23 '21 19:11 dougbenjamin