dcache icon indicating copy to clipboard operation
dcache copied to clipboard

Webdav response when pool is unavailable

Open ageorget opened this issue 2 years ago • 2 comments

Hi,

When a dCache pool is unavailable (offline for maintenance), a lot of error messages in the Webdav door log indicate that the door timed out trying to contact the Poolmanager :

07 Mar 2022 11:55:17 (webdav-ccdcatli367) [door:webdav-ccdcatli367@webdav-ccdcatli367Domain:AAXZno41EYA] Internal server error: org.dcache.webdav.WebDavException: Request to [>PoolManager@dCacheDomain] timed out.

level=ERROR ts=2022-03-07T11:55:17.132+0100 event=org.dcache.webdav.request request.method=GET request.url=https://ccdavatlas.in2p3.fr:2880/atlasdatadisk/rucio/mc21_13p6TeV/fa/90/HITS.27681140._039474.pool.root.1 response.code=500 socket.remote=[2605:9a00:10:200a:a236:9fff:feb5:1a4]:45394 user-agent=dCache/6.2.33 user.mapped=3327:124[oG7WnQ/8:0ynDS5xo] transaction=door:webdav-ccdcatli367@webdav-ccdcatli367Domain:AAXZno41EYA:1646649897467000

This is confusing because I first thought that there was a communication issue between the door and PoolManager. But these messages are all corresponding to requests from clients trying to access files on the unavailable pool. Is it normal that the PoolManger doesn't return the information that the file is unavailable to the Webdav door instead of letting the Webdav request failing with timeout?

Another thing, isn't the HTTP response code 503 more appropriate to handle these kind error instead of 500 ? 503 Service Unavailable The server is currently unable to handle the request due to a temporary overloading or maintenance of the server.

Adrien

ageorget avatar Mar 07 '22 12:03 ageorget

Hi Adrien,

I agree, this is rather annoying.

To clarify what's happening here. The door is asking pool-manager "give me a pool from which my user can read this file". Currently, pool-manager will only[*] reply to this message when it has satisfied this request: when there's a pool from which the file's data may be read. Depending on the circumstances, this might trigger staging data back from tape and/or pool-to-pool transfers, so could take some time. The door simply waits for the response.

[*] -- there are some odd-ball very obscure situations where pool-manager will return with an error, but those don't apply in this case.

Pool manager knows when a pool is down. If that pool has the only replica of the file then it will wait for that pool to come back up again before replying.

For some protocols (e.g., dcap) that works fine. The client will wait indefinitely for dCache to provide the pool.

However, for other protocols (e.g., WebDAV, xroot) the client won't wait forever and will time out if the door doesn't reply in time. The door must reply to the client before that client-timeout, so the door imposes a timeout on the pool-manager response. If there isn't a response (in time) then the door behaves (as you describe above) as as if the message was lost.

One way to solve this problem is if pool-manager were to reply to the door that it couldn't identify a pool with the file's data in time, perhaps also giving a reason why not. The door could then handle this better: it might refrain from logging it. If logged, the entry would contain more helpful information. ...etc...

dCache actually has most of the pieces needed to support this mode of operation. I started working on this, but pool-manager is rather old code and we have to tread very carefully, to make sure we don't break things. Unfortunately, I'm focusing on other things currently, so cannot say when this will be fixed.

On 503 vs 500 status code, I don't have a strong opinion.

Certainly the 503 status code supports the 'Retry-After' response header, which might be useful if we knew when a subsequent request was likely to succeed. I'm not sure that we do. The RFC says that, in the absence of a Retry-After header, the client should treat a 503 as if it were a 500.

Also, I'm not sure if 503 indicates the problem is with an individual resource (a file) or the entire server. Would a client, on receiving 503 and Retry-After 5 minutes, refrain from sending any requests (for any file) for five minutes?

Cheers, Paul.

paulmillar avatar Mar 07 '22 13:03 paulmillar

Hi Paul,

Thanks a lot for your exhaustive answer, as always! Good to know that you already started to work on this, as it just lead to confusion for a dCache administrator it's not an urgent issue.

ageorget avatar Mar 07 '22 13:03 ageorget