neon In progress attach/detach operations are lost during pageserver restart

Can be solved with additional flags in timeline metadata.

The problem happens when pageserver answers with ok to detach or attach and then restarts before the operation succeeds

Aug 09 '22 17:08 LizardWizzard

see https://github.com/neondatabase/neon/pull/2239

Aug 09 '22 19:08 hlinnaka

Attach is no longer lost, still relevant for detach

Feb 27 '23 22:02 LizardWizzard

@ololobus do you retry these API calls ? what is the policy from the control plane on such failures ?

Mar 14 '23 12:03 shanyp

do you retry these API calls ? what is the policy from the control plane on such failures ?

Detach tenant is used now when we delete project in the console. And actually we even ignore 500s now (see [1]), as previously pageserver was responding with 500 if tenant is absent locally. But this could be a retry of tenant deletion operation, so we had no option to distinguish whether it's tenant not found or some actual internal error. BTW, is it still the case? If pageserver can respond with 404 now if there is no tenant locally, then we can change and retry all non 200 and 404 only
Attach is only used in the migration code, and as far as I see it will retry everything non-202. See [2]. Existing code for migration is mostly broken and doesn't take into account many corner-cases like presence of many active computes. I have an issue for refactoring this control-plane code https://github.com/neondatabase/cloud/issues/3113. I think we can start working on it in parallel with planed Dmitry's changes on the pageserver, but it seems that we should sync with you and Dmitry first

[1] https://github.com/neondatabase/cloud/blob/5de42edae8adf572985268d82b39919b1360a012/goapp/internal/client/psclient/httppageserver/httppageserver.go#L259-L273 [2] https://github.com/neondatabase/cloud/blob/5de42edae8adf572985268d82b39919b1360a012/goapp/internal/client/psclient/httppageserver/httppageserver.go#L329-L333

Mar 14 '23 13:03 ololobus

Yep, this is still the case:

> curl -v -XPOST http://localhost:9898/v1/tenant/1acf99ebac07d01ea7abf3bd104e8e58/detach
*   Trying 127.0.0.1:9898...
* Connected to localhost (127.0.0.1) port 9898 (#0)
> POST /v1/tenant/1acf99ebac07d01ea7abf3bd104e8e58/detach HTTP/1.1
> Host: localhost:9898
> User-Agent: curl/7.85.0
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 500 Internal Server Error
< content-type: application/json
< x-request-id: 4d355c67-035e-46fb-9602-6225e41e4ecf
< pageserver_launch_timestamp: 2023-03-14 15:15:56.572595838 UTC
< content-length: 66
< date: Tue, 14 Mar 2023 15:16:21 GMT
< 
* Connection #0 to host localhost left intact
{"msg":"Tenant not found for id 1acf99ebac07d01ea7abf3bd104e8e58"}

So we need to fix this first.

Attach wont be lost, this is already done. Removing attach from the title to reflect that

Mar 14 '23 15:03 LizardWizzard

Attach wont be lost, this is already done.

How pageserver reacts on the repeated attach? Is it the same 202 or 500 or something else?

Mar 14 '23 15:03 ololobus

It will be 409 Conflict: Tenant X already exists

Mar 14 '23 15:03 LizardWizzard

@ololobus this is now fixed, so for a non existent tenant the response will be 404 which shouldn't be retried, but other 5xx should

Mar 20 '23 12:03 shanyp

@shanyp thanks! Created https://github.com/neondatabase/cloud/issues/4368 to do not forget

Mar 20 '23 16:03 ololobus

Summary of what happened:

#3818 added distinguished response for detach & delete
https://github.com/neondatabase/cloud/pull/4391 uses these distinguished responses
- This closes the issue that @ololobus created on Mar 20

So, the following race conditions around delete and detach have been fixed since then

sequenceDiagram
Console -->> PS: detach
activate PS
Note right of PS: deletes all on-disk state
Note right of PS: crashes / restarts 
deactivate PS
Console -->> PS: retries detach
activate PS
PS -->> Console: 500 (now: 404)
deactivate PS

sequenceDiagram
Console -->> PS: detach
activate PS
Note right of PS: does all the detach work
PS --x Console: 200
deactivate PS
Note right of Console: response gets lost
Note left of Console: console retries

Console -->> PS: retries detach
activate PS
PS -->> Console: 500 (now: 404)
deactivate PS

May 19 '23 15:05 problame

I agree with the analysis. It seems that the main issue was the absence of retried.

However, I'm not sure that we can safely continue from interrupted detach operation. I e:

Console calls delete, pageserver starts to delete files, then crashes and restarts. Because file deletion order is not specified and there no special detach marker which can prevent pageserver from loading this tenant during startup we can have half broken tenant that wont be deleted by following retry from console.

We;ll have mark file for deletion of a tenant as described in the RFC. See https://github.com/zenithdb/zenith/blob/4158e24e60d294e0f039395ea95dd87f8ab317d9/docs/rfcs/022-pageserver-delete-from-s3.md#L76

I think we can close this issue, I can create another one in deletion project that mentions specifically out of order deletions. WDYT?

May 19 '23 15:05 LizardWizzard

Console calls delete, pageserver starts to delete files, then crashes and restarts. Because file deletion order is not specified and there no special detach marker which can prevent pageserver from loading this tenant during startup we can have half broken tenant that wont be deleted by following retry from console.

We;ll have mark file for deletion of a tenant as described in the RFC. See https://github.com/zenithdb/zenith/blob/4158e24e60d294e0f039395ea95dd87f8ab317d9/docs/rfcs/022-pageserver-delete-from-s3.md#L76

Yeah, that's an issue but it's orthogonal to this one.

Closing this one. Please create a new issue and add it to your deletion project.

May 19 '23 16:05 problame

neon neon copied to clipboard

In progress attach/detach operations are lost during pageserver restart

neon
neon copied to clipboard