dcache icon indicating copy to clipboard operation
dcache copied to clipboard

TPC pull transfers fails with Transfer failure: Failed to select pool: No pool candidates available/configured/left for stage\n

Open cfgamboa opened this issue 2 years ago • 1 comments

Dear all

We have observed that a file staged from tape to disk was not healthy. Verifying the file checksum dCache marked as BAD.

(dc224_38@dc224thirtyeightDomain) admin > csm status
FullScan Idle  0 files: 0 corrupt, 0 unable to check
SingleScan Idle   00003FE114D521614F2F8700BCF3DC07EF81 BAD File = [1:84d1fb0a] Expected = [1:6d208d8d]
Scrubber Idle  processed 0 of 0 files: 0 corrupt, 0 unable to check

The TPC attempt message to retrieve this file:

TRANSFER_FAILED:TRANSFER ERROR: Copy failed (3rd pull, 3rd push). Last attempt: Transfer failure: Failed to select pool: No pool candidates available/configured/left for stage\n

After cleaning the file and re-staging it the transfer succeeded.

Is there any possibility to associate the TPC error to alert issues with the staged file?

All the best, Carlos

cfgamboa avatar May 01 '23 13:05 cfgamboa

Just to add some info here.

What's happening is that pool-manager was asked to identify a pool from which the transfer would be possible, and it failed to do so.

Pool-manager has a number of different strategies for identifying the transfer pool. Sometimes it just involves identifying the pool with that file's data. In other cases it might involve interacting with the pools, possibly multiple times. For example, pool-manager might need to stage the file back from tape, and then (after the file is staged) transfer the data from a "staging" pool to a "transfer/read" pool.

The error message No pool candidates available/configured/left for stage indicates that pool-manager has given up on its job: trying to identify the pool from which the file may be read. The exact choice of message is somewhat cryptic, but it (somewhat deliberately) doesn't provide a detailed list of all the things it tried and that didn't work.

Of course, it would be technically possible to give a more precise error message, but in order to cover all the edge-cases, I suspect that precise error message would need to list all the things pool-manager tried (the different strategies) and why each of them failed: simply listing the last failure might not be sufficient to understand what went wrong.

That would make for a rather long error message (which is already quite long!).

So, while I agree that there's not enough message here to diagnose the problem, I'm also wondering whether having a longer error message (returned to the user) would really be the best approach, or whether providing the information through some other reporting channel might be a better approach. For example, pool-manager could write a log entry for such failures that provides a detailed description of what it tried and why each strategy failed.

paulmillar avatar May 04 '23 08:05 paulmillar