dcache
dcache copied to clipboard
TRANSFER ERROR: Copy failed (3rd push). Last attempt: Transfer failure: Failed to select pool
Hello
The following is an extract from the FTS log for a transfer failing attempting a TCP (push mode) via RTM
INFO Wed, 07 Sep 2022 23:42:41 -0400; Source url: davs://dcbldoor.sdcc.bnl.gov:443/pnfs/sdcc.bnl.gov/data/bellediskdata/TMP/belle/test/TPC/1G-test?copy_mode=push
INFO Wed, 07 Sep 2022 23:42:41 -0400; Dest url: davs://dcachewebdav-kit.gridka.de:2880/pnfs/gridka.de/belle/disk-only/TMP/belle/test/TPC/1G-test-push-BNL-TMP-SE
INFO Wed, 07 Sep 2022 23:47:42 -0400; [1662608862261] BOTH http_plugin TRANSFER:EXIT ERROR: Copy failed (3rd push). Last attempt: Transfer failure: Failed to select pool: Request to [>PoolManager@dcblcore03Domain] timed out.\n
ERR Wed, 07 Sep 2022 23:47:42 -0400; Recoverable error: [5] TRANSFER ERROR: Copy failed (3rd push). Last attempt: Transfer failure: Failed to select pool: Request to [>PoolManager@dcblcore03Domain] timed out.\n
I was able to reproduced similar transfer
gfal-copy davs://dcbldoor04.sdcc.bnl.gov:443/pnfs/sdcc.bnl.gov/data/bellediskdata/TMP/belle/test/TPC/1G-test?copy_mode=push davs://dcdoor01.usatlas.bnl.gov:443/pnfs/sdcc.bnl.gov/users/cgamboa/testb2tcp-push
At the RTM cell
We are waiting for the following messages
<-8950831661387595737:-8712667021042240632> r=38 sec;msg=diskCacheV111.vehicles.PoolMgrSelectReadPoolMsg/PnfsId=000085E35223830B4955A88D93687AEA93AB;StorageInfo=size=0;new=false;stored=false;sClass=bellediskdata:BELLEDISKDATA;cClass=-;hsm=osm;accessLatency=NEARLINE;retentionPolicy=CUSTODIAL;path=/pnfs/sdcc.bnl.gov/data/bellediskdata/TMP/belle/test/TPC/1G-test;uid=54013;writeToken=2;gid=69960;StoreName=bellediskdata;links=00002E68395D653F483C81E24A2A4BB894C4 1G-test;store=bellediskdata;group=BELLEDISKDATA;bfid=<Unknown>;;
<-1940767579387507220:-5646637552986661129> r=53 sec;msg=org.dcache.poolmanager.PoolMgrGetUpdatedHandler/
At the pool manager
08 Sep 2022 16:04:38 [pool-7-thread-43] [door:WebDAV2-dcbldoor04-2@webdav2-dcbldoor04_httpsDomain:AAXoL3Vi2qg RemoteTransferManager PoolMgrSelectReadPool 000085E35223830B4955A88D93687AEA93AB] ACTIVATING STATE ENGINE 000085E35223830B4955A88D93687AEA93AB 2064423
Looking at the database for RTM there not information of the file (pool is not available)
transfermanagers=# select * from "REMOTETRANSFERS_B" where "REMOTEURL" like '%https://dcdoor01.usatlas.bnl.gov:443/pnfs/sdcc.bnl.gov/users/cgamboa/testb2tcp-push%';
ID | CREATED | CREATIONTIME | CREDENTIALID | LIFETIME | MOOVERID | PNFSID | LOCALPATH | POOLNAME | REMOTEURL | STATE | WRITE
------------------+---------+---------------+--------------+----------+----------+--------------------------------------+------------------------------------------------------------------+----------+-------------------------------------------------------------------------------------+-------+-------
1662667285234000 | f | 1662667285234 | 0 | 0 | | 000085E35223830B4955A88D93687AEA93AB | /pnfs/sdcc.bnl.gov/data/bellediskdata/TMP/belle/test/TPC/1G-test | | https://dcdoor01.usatlas.bnl.gov:443/pnfs/sdcc.bnl.gov/users/cgamboa/testb2tcp-push | -1 | f
(1 row)
So I could not find an evident error trace/excemption on the logs to point out why the RTM fails to get the pool information
I can get the file with a simple transfer.
Carlos
This sounds like the underlying cause is that there are no pools available with that file.
PoolManager will wait (until a timeout) under these circumstances, even if there are no locations available.
If so, the transfer failure is reasonable, but the error message is confusing.
Unfortunately the problem is even more confusing as the pool is available and it holds the file. The file can be retrieved via a simple transfer
gfal-copy davs://dcbldoor04.sdcc.bnl.gov:443/pnfs/usatlas.bnl.gov/data/bellediskdata/TMP/belle/test/TPC/1G-test .
The same transfer succeeded via PULL mode
gfal-copy -f davs://dcbldoor03.sdcc.bnl.gov:2881/pnfs/sdcc.bnl.gov/data/bellediskdata/TMP/belle/test/TPC/1G-test?copy_mode=pull davs://dcdoor01.usatlas.bnl.gov:443/pnfs/usatlas.bnl.gov/users/cgamboa/testb2tcp-push
Copying davs://dcbldoor03.sdcc.bnl.gov:2881/pnfs/sdcc.bnl.gov/data/bellediskdata/TMP/belle/test/TPC/1G-test?copy_mode=pull [DONE] after 68s
Looking at the log the pool manager on log all I see entries for
09 Sep 2022 15:30:02 (PoolManager) [SW0:O70 admin] addToEventQueue : lock found for : <CM: S=[>PoolManager@dcblcore03Domain];D=[>admin@dcblcore03Domain];C=java.lang.String;[email protected]/0-RemoteHttpsDataTransfer/1 m=0 r=0 [<unknown>] [Suspended: Pool to pool not permitted 09.09 14:39:06] {0,}
The I see the pool manager is attempting a restore request
(PoolManager@dcblcore03Domain) admin > rc ls
[email protected]/0-RemoteHttpsDataTransfer/1 m=23 r=0 [<unknown>] [Suspended: Pool to pool not permitted 09.09 15:54:04] {0,}
This is a disk only storage so should not attempt to do this.
OK, so there's a few things going on here.
First, the file is on a pool, but pool-manager didn't select that pool. Unfortunately, we don't know why. Possible reasons include:
- pool-manager thinking (for whatever reason) that the pool is offline,
- the pool is overloaded,
- [...]
Does this make sense? Was the pool with this file "overloaded" at that time?
Having decided that the pool isn't available for the transfer, pool-manager then ruled out a pool-to-pool transfer. Seemingly, this is because (apparently) pool-to-pool transfers are not allowed to any pools that would be allowed to make the third-party transfer.
Does this make sense? Is pool-manager configured so that pool-to-pool transfers are not allowed?
Finally, you say "I see the pool manager is attempting a restore request"; however, you didn't showed evidence that pool-manager made a stage request. Instead, it looks like pool-manager simply suspended the request. In effect, pool-manager is waiting just in case another pool boots up and happens to have a replica of the file.
I'm not 100% certain, but it seems that the pool with the file was (in some sense) overloaded and therefore dCache failed the request.
Unfortunately, none of the error messages make it clear what's going on. This is a long-standing problem with pool-manager. Unfortunately, there's no quick fixes for this.
Assuming the above is correct, to resolve the problem you can do one of the following:
- update configuration to allow more concurrent transfers on a single pool,
- add more hardware (more pools) to spread the load over more pools,
- update configuration to allow pool-to-pool transfers,
- update FTS configuration to reduce the number of concurrent transfers.
Each decision carries consequences, which should be understood before going ahead.
HTH, Paul.
Hello,
No the pool is not offline and it is not overloaded.
All the best, Carlos
Separate from the underlying problem (which we currently don't understand), there's a compounding problem in that pool-manager only records part of its decision-making progress.
The log entry says Pool to pool not permitted as a reason why the transfer was suspended; however, (unfortunately) this log entry doesn't say why it wanted to initiate a pool-to-pool transfer. The log entry is silent on why pool-manager didn't simply select the pool with this file's replica.
We may need to fix this lack of information (in the log entry) problem, so the underlying problem may be understood and fixed.
FYI, this issue is being trace on the dcache support RT system, 10374.
older duplicate, now tracked on RT