neon
neon copied to clipboard
In progress attach/detach operations are lost during pageserver restart
Can be solved with additional flags in timeline metadata.
The problem happens when pageserver answers with ok to detach or attach and then restarts before the operation succeeds
see https://github.com/neondatabase/neon/pull/2239
Attach is no longer lost, still relevant for detach
@ololobus do you retry these API calls ? what is the policy from the control plane on such failures ?
do you retry these API calls ? what is the policy from the control plane on such failures ?
- Detach tenant is used now when we delete project in the console. And actually we even ignore 500s now (see [1]), as previously pageserver was responding with 500 if tenant is absent locally. But this could be a retry of tenant deletion operation, so we had no option to distinguish whether it's tenant not found or some actual internal error. BTW, is it still the case? If pageserver can respond with 404 now if there is no tenant locally, then we can change and retry all non 200 and 404 only
- Attach is only used in the migration code, and as far as I see it will retry everything non-202. See [2]. Existing code for migration is mostly broken and doesn't take into account many corner-cases like presence of many active computes. I have an issue for refactoring this control-plane code https://github.com/neondatabase/cloud/issues/3113. I think we can start working on it in parallel with planed Dmitry's changes on the pageserver, but it seems that we should sync with you and Dmitry first
[1] https://github.com/neondatabase/cloud/blob/5de42edae8adf572985268d82b39919b1360a012/goapp/internal/client/psclient/httppageserver/httppageserver.go#L259-L273 [2] https://github.com/neondatabase/cloud/blob/5de42edae8adf572985268d82b39919b1360a012/goapp/internal/client/psclient/httppageserver/httppageserver.go#L329-L333
- Yep, this is still the case:
> curl -v -XPOST http://localhost:9898/v1/tenant/1acf99ebac07d01ea7abf3bd104e8e58/detach
* Trying 127.0.0.1:9898...
* Connected to localhost (127.0.0.1) port 9898 (#0)
> POST /v1/tenant/1acf99ebac07d01ea7abf3bd104e8e58/detach HTTP/1.1
> Host: localhost:9898
> User-Agent: curl/7.85.0
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 500 Internal Server Error
< content-type: application/json
< x-request-id: 4d355c67-035e-46fb-9602-6225e41e4ecf
< pageserver_launch_timestamp: 2023-03-14 15:15:56.572595838 UTC
< content-length: 66
< date: Tue, 14 Mar 2023 15:16:21 GMT
<
* Connection #0 to host localhost left intact
{"msg":"Tenant not found for id 1acf99ebac07d01ea7abf3bd104e8e58"}
So we need to fix this first.
- Attach wont be lost, this is already done. Removing attach from the title to reflect that
- Attach wont be lost, this is already done.
How pageserver reacts on the repeated attach? Is it the same 202 or 500 or something else?
It will be 409 Conflict: Tenant X already exists
@ololobus this is now fixed, so for a non existent tenant the response will be 404 which shouldn't be retried, but other 5xx should
@shanyp thanks! Created https://github.com/neondatabase/cloud/issues/4368 to do not forget
Summary of what happened:
- #3818 added distinguished response for detach & delete
- https://github.com/neondatabase/cloud/pull/4391 uses these distinguished responses
- This closes the issue that @ololobus created on Mar 20
So, the following race conditions around delete and detach have been fixed since then
sequenceDiagram
Console -->> PS: detach
activate PS
Note right of PS: deletes all on-disk state
Note right of PS: crashes / restarts
deactivate PS
Console -->> PS: retries detach
activate PS
PS -->> Console: 500 (now: 404)
deactivate PS
sequenceDiagram
Console -->> PS: detach
activate PS
Note right of PS: does all the detach work
PS --x Console: 200
deactivate PS
Note right of Console: response gets lost
Note left of Console: console retries
Console -->> PS: retries detach
activate PS
PS -->> Console: 500 (now: 404)
deactivate PS
I agree with the analysis. It seems that the main issue was the absence of retried.
However, I'm not sure that we can safely continue from interrupted detach operation. I e:
Console calls delete, pageserver starts to delete files, then crashes and restarts. Because file deletion order is not specified and there no special detach marker which can prevent pageserver from loading this tenant during startup we can have half broken tenant that wont be deleted by following retry from console.
We;ll have mark file for deletion of a tenant as described in the RFC. See https://github.com/zenithdb/zenith/blob/4158e24e60d294e0f039395ea95dd87f8ab317d9/docs/rfcs/022-pageserver-delete-from-s3.md#L76
I think we can close this issue, I can create another one in deletion project that mentions specifically out of order deletions. WDYT?
Console calls delete, pageserver starts to delete files, then crashes and restarts. Because file deletion order is not specified and there no special detach marker which can prevent pageserver from loading this tenant during startup we can have half broken tenant that wont be deleted by following retry from console.
We;ll have mark file for deletion of a tenant as described in the RFC. See https://github.com/zenithdb/zenith/blob/4158e24e60d294e0f039395ea95dd87f8ab317d9/docs/rfcs/022-pageserver-delete-from-s3.md#L76
Yeah, that's an issue but it's orthogonal to this one.
Closing this one. Please create a new issue and add it to your deletion project.