dcache icon indicating copy to clipboard operation
dcache copied to clipboard

TRANSFER ERROR: Copy failed (3rd push). Last attempt: Transfer failure: Failed to select pool

Open cfgamboa opened this issue 3 years ago • 7 comments
trafficstars

Hello

The following is an extract from the FTS log for a transfer failing attempting a TCP (push mode) via RTM

INFO    Wed, 07 Sep 2022 23:42:41 -0400; Source url: davs://dcbldoor.sdcc.bnl.gov:443/pnfs/sdcc.bnl.gov/data/bellediskdata/TMP/belle/test/TPC/1G-test?copy_mode=push
INFO    Wed, 07 Sep 2022 23:42:41 -0400; Dest url: davs://dcachewebdav-kit.gridka.de:2880/pnfs/gridka.de/belle/disk-only/TMP/belle/test/TPC/1G-test-push-BNL-TMP-SE


INFO    Wed, 07 Sep 2022 23:47:42 -0400; [1662608862261] BOTH http_plugin	TRANSFER:EXIT	ERROR: Copy failed (3rd push). Last attempt: Transfer failure: Failed to select pool: Request to [>PoolManager@dcblcore03Domain] timed out.\n
ERR     Wed, 07 Sep 2022 23:47:42 -0400; Recoverable error: [5] TRANSFER ERROR: Copy failed (3rd push). Last attempt: Transfer failure: Failed to select pool: Request to [>PoolManager@dcblcore03Domain] timed out.\n

I was able to reproduced similar transfer

gfal-copy davs://dcbldoor04.sdcc.bnl.gov:443/pnfs/sdcc.bnl.gov/data/bellediskdata/TMP/belle/test/TPC/1G-test?copy_mode=push davs://dcdoor01.usatlas.bnl.gov:443/pnfs/sdcc.bnl.gov/users/cgamboa/testb2tcp-push

At the RTM cell

We are waiting for the following messages
<-8950831661387595737:-8712667021042240632> r=38 sec;msg=diskCacheV111.vehicles.PoolMgrSelectReadPoolMsg/PnfsId=000085E35223830B4955A88D93687AEA93AB;StorageInfo=size=0;new=false;stored=false;sClass=bellediskdata:BELLEDISKDATA;cClass=-;hsm=osm;accessLatency=NEARLINE;retentionPolicy=CUSTODIAL;path=/pnfs/sdcc.bnl.gov/data/bellediskdata/TMP/belle/test/TPC/1G-test;uid=54013;writeToken=2;gid=69960;StoreName=bellediskdata;links=00002E68395D653F483C81E24A2A4BB894C4 1G-test;store=bellediskdata;group=BELLEDISKDATA;bfid=<Unknown>;;
<-1940767579387507220:-5646637552986661129> r=53 sec;msg=org.dcache.poolmanager.PoolMgrGetUpdatedHandler/

At the pool manager

08 Sep 2022 16:04:38 [pool-7-thread-43] [door:WebDAV2-dcbldoor04-2@webdav2-dcbldoor04_httpsDomain:AAXoL3Vi2qg RemoteTransferManager PoolMgrSelectReadPool 000085E35223830B4955A88D93687AEA93AB] ACTIVATING STATE ENGINE 000085E35223830B4955A88D93687AEA93AB 2064423

Looking at the database for RTM there not information of the file (pool is not available)

transfermanagers=# select * from "REMOTETRANSFERS_B" where "REMOTEURL" like '%https://dcdoor01.usatlas.bnl.gov:443/pnfs/sdcc.bnl.gov/users/cgamboa/testb2tcp-push%';
        ID        | CREATED | CREATIONTIME  | CREDENTIALID | LIFETIME | MOOVERID |                PNFSID                |                            LOCALPATH                             | POOLNAME |                                      REMOTEURL                                      | STATE | WRITE 
------------------+---------+---------------+--------------+----------+----------+--------------------------------------+------------------------------------------------------------------+----------+-------------------------------------------------------------------------------------+-------+-------
 1662667285234000 | f       | 1662667285234 |            0 |        0 |          | 000085E35223830B4955A88D93687AEA93AB | /pnfs/sdcc.bnl.gov/data/bellediskdata/TMP/belle/test/TPC/1G-test |          | https://dcdoor01.usatlas.bnl.gov:443/pnfs/sdcc.bnl.gov/users/cgamboa/testb2tcp-push |    -1 | f
(1 row)

So I could not find an evident error trace/excemption on the logs to point out why the RTM fails to get the pool information

I can get the file with a simple transfer.

Carlos

cfgamboa avatar Sep 08 '22 20:09 cfgamboa

This sounds like the underlying cause is that there are no pools available with that file.

PoolManager will wait (until a timeout) under these circumstances, even if there are no locations available.

If so, the transfer failure is reasonable, but the error message is confusing.

paulmillar avatar Sep 09 '22 07:09 paulmillar

Unfortunately the problem is even more confusing as the pool is available and it holds the file. The file can be retrieved via a simple transfer

gfal-copy davs://dcbldoor04.sdcc.bnl.gov:443/pnfs/usatlas.bnl.gov/data/bellediskdata/TMP/belle/test/TPC/1G-test .

cfgamboa avatar Sep 09 '22 14:09 cfgamboa

The same transfer succeeded via PULL mode

 gfal-copy -f davs://dcbldoor03.sdcc.bnl.gov:2881/pnfs/sdcc.bnl.gov/data/bellediskdata/TMP/belle/test/TPC/1G-test?copy_mode=pull davs://dcdoor01.usatlas.bnl.gov:443/pnfs/usatlas.bnl.gov/users/cgamboa/testb2tcp-push
Copying davs://dcbldoor03.sdcc.bnl.gov:2881/pnfs/sdcc.bnl.gov/data/bellediskdata/TMP/belle/test/TPC/1G-test?copy_mode=pull   [DONE]  after 68s

cfgamboa avatar Sep 09 '22 20:09 cfgamboa

Looking at the log the pool manager on log all I see entries for

09 Sep 2022 15:30:02 (PoolManager) [SW0:O70 admin] addToEventQueue : lock found for : <CM: S=[>PoolManager@dcblcore03Domain];D=[>admin@dcblcore03Domain];C=java.lang.String;[email protected]/0-RemoteHttpsDataTransfer/1 m=0 r=0 [<unknown>] [Suspended: Pool to pool not permitted 09.09 14:39:06] {0,}

The I see the pool manager is attempting a restore request

 (PoolManager@dcblcore03Domain) admin > rc ls
[email protected]/0-RemoteHttpsDataTransfer/1 m=23 r=0 [<unknown>] [Suspended: Pool to pool not permitted 09.09 15:54:04] {0,}

This is a disk only storage so should not attempt to do this.

cfgamboa avatar Sep 09 '22 20:09 cfgamboa

OK, so there's a few things going on here.

First, the file is on a pool, but pool-manager didn't select that pool. Unfortunately, we don't know why. Possible reasons include:

  • pool-manager thinking (for whatever reason) that the pool is offline,
  • the pool is overloaded,
  • [...]

Does this make sense? Was the pool with this file "overloaded" at that time?

Having decided that the pool isn't available for the transfer, pool-manager then ruled out a pool-to-pool transfer. Seemingly, this is because (apparently) pool-to-pool transfers are not allowed to any pools that would be allowed to make the third-party transfer.

Does this make sense? Is pool-manager configured so that pool-to-pool transfers are not allowed?

Finally, you say "I see the pool manager is attempting a restore request"; however, you didn't showed evidence that pool-manager made a stage request. Instead, it looks like pool-manager simply suspended the request. In effect, pool-manager is waiting just in case another pool boots up and happens to have a replica of the file.

I'm not 100% certain, but it seems that the pool with the file was (in some sense) overloaded and therefore dCache failed the request.

Unfortunately, none of the error messages make it clear what's going on. This is a long-standing problem with pool-manager. Unfortunately, there's no quick fixes for this.

Assuming the above is correct, to resolve the problem you can do one of the following:

  1. update configuration to allow more concurrent transfers on a single pool,
  2. add more hardware (more pools) to spread the load over more pools,
  3. update configuration to allow pool-to-pool transfers,
  4. update FTS configuration to reduce the number of concurrent transfers.

Each decision carries consequences, which should be understood before going ahead.

HTH, Paul.

paulmillar avatar Sep 12 '22 08:09 paulmillar

Hello,

No the pool is not offline and it is not overloaded.

All the best, Carlos

cfgamboa avatar Sep 12 '22 13:09 cfgamboa

Separate from the underlying problem (which we currently don't understand), there's a compounding problem in that pool-manager only records part of its decision-making progress.

The log entry says Pool to pool not permitted as a reason why the transfer was suspended; however, (unfortunately) this log entry doesn't say why it wanted to initiate a pool-to-pool transfer. The log entry is silent on why pool-manager didn't simply select the pool with this file's replica.

We may need to fix this lack of information (in the log entry) problem, so the underlying problem may be understood and fixed.

paulmillar avatar Sep 18 '22 14:09 paulmillar

FYI, this issue is being trace on the dcache support RT system, 10374.

cfgamboa avatar Sep 26 '22 14:09 cfgamboa

older duplicate, now tracked on RT

lemora avatar Oct 17 '22 07:10 lemora