rippled icon indicating copy to clipboard operation
rippled copied to clipboard

RPC tooBusy response now has 503 HTTP status code:

Open scottschurr opened this issue 3 years ago • 19 comments

High Level Overview of Change

Fixes #4005

Makes it possible for internal RPC Error Codes to associate themselves with a non-OK (200) HTTP status code. This commit also adds non-OK HTTP responses to a few additional RPC error responses.

Context of Change

Issue #4005 notes that returning an HTTP 200 (OK) status code when the server is overloaded is misleading for load balancers. This proposed change allows each RPC error response to be associated with its own HTTP status code.

If an explicit HTTP status code is not specified, then 200 (OK) is assumed. This preserves much of the status quo since prior to this change a 200 (OK) status code was returned in all cases.

The pull request includes a best guess at other cases where a non-200 status code should be returned. Reviewers are requested to look those over carefully and recommend changes. Thanks.

Type of Change

  • [x] Potentially breaking change (fix or feature that could cause existing functionality to not work as expected).

The change is potentially breaking in cases where servers are expecting to get a 200 (OK) status code back since they will no longer get that status code in certain error cases. It's unknown whether that problem would occur in practice.

There is no amendment associated with the change, since it does not change transaction results. However the change does affect the API. So care is justified.

Before / After

I hacked the server_info command to return an rpc_TOO_BUSY error, then I used the following curl command:

#!/bin/bash
# Use curl to make a server_info request
curl -i -d '{ "method": "server_info", "params": [ {"ripplerpc": "1.0"} ] }' -H 'Content-Type: application/json' -X POST http://127.0.0.1:5005/

Without the change I get the following response:

HTTP/1.1 200 OK
Date: Mon, 11 Apr 2022 17:53:55 +0000
Connection: Keep-AliveContent-Length: 195
Content-Type: application/json; charset=UTF-8
Server: ripple-json-rpc/rippled-1.9.0+7c66747d27869f9f3c96617bd4227038f1fa92b8.DEBUG

{"result":{"error":"tooBusy","error_code":9,"error_message":"The server is too busy to help you now.","request":{"command":"server_info","ripplerpc":"1.0"},"status":"error"},"ripplerpc":"1.0"}

However with the change I get the following response (note the changed status code):

HTTP/1.1 503 Server is overloaded
Date: Mon, 11 Apr 2022 18:03:41 +0000
Connection: Keep-AliveContent-Length: 195
Content-Type: application/json; charset=UTF-8
Server: ripple-json-rpc/rippled-1.9.0+7c66747d27869f9f3c96617bd4227038f1fa92b8.DEBUG

{"result":{"error":"tooBusy","error_code":9,"error_message":"The server is too busy to help you now.","request":{"command":"server_info","ripplerpc":"1.0"},"status":"error"},"ripplerpc":"1.0"}

Test Plan

The rippled testing infrastructure currently ignores the returned HTTP status code. I could figure out how to fix that. But I'd like to suggest that we content ourselves with integration tests. Potentially the integration tests could use curl commands similar to what I used for manual testing. However we'd need to actually overload the server, rather than hacking a command to return an inaccurate result.

More thoughts are appreciated regarding testing.

Reviewers

I'd appreciate review and comments from @WietseWind who wrote the initial issue.

scottschurr avatar Apr 11 '22 18:04 scottschurr

Thank you @scottschurr! Great solution, being able to configure alternative error codes per error.

I see that you already added error codes for the most important errors/codes. Going through the list I feel there are many more eligible for (mostly) 400 / 401 / 403 / 500 HTTP status codes (e.g. rpcFORBIDDEN, rpcNOT_READY, rpcNO_NETWORK, rpcJSON_RPC, rpcNO_PERMISSION).

Question is where to draw the line between something the client asked for (response still OK, 200, to be expected) and where it's a response that should explicitly inform the client that something is off.

What originally caused me to post #4005 is the load balancer auto-failover use case. I feel that's where we should draw the line: if a query fails in the 4xx/5xx range and the exact same query is likely to have a positive (200) outcome on another node, it should return a non-200 status code.

If you agree I'll go through the list and add some more 4xx/5xx status codes. But other than adding some status codes: perfect. Thanks again!

WietseWind avatar Apr 11 '22 23:04 WietseWind

@WietseWind, thanks for your comment. I'm not really a load balancer kind of guy. I mostly keep my nose in the C++. So I'm happy to use your suggestions for which status codes should be returned. Then I'll rely on @cjcobb23 and @intelliot to verify that they look right from their perspectives.

So, yes, please go through the list and add some more 4xx/5xx status codes. I'll also add a comment regarding how the line was drawn between 200 and other. Thanks!

scottschurr avatar Apr 12 '22 01:04 scottschurr

@WietseWind, I'm hoping you'll have time to indicate where you'd like additional HTTP status codes added. Is there anything you need from me to push that forward? Thanks.

scottschurr avatar May 11 '22 17:05 scottschurr

Hey @WietseWind, on April 11th you noted:

If you agree I'll go through the list and add some more 4xx/5xx status codes.

That's great and would be appreciated. But you've had five weeks now to do that. I know you're really busy. If you can get back to me within the next week to at least let me know when you can get around to identifying those additional status codes I'd really appreciate it.

If I don't hear back from you within the week I'll assume you're too busy to make the assessment right now and we can commit the HTTP status code changes that are currently in the pull request. If that should happen, then you can make a later pull request to put in the additional status codes.

Thanks for the help.

scottschurr avatar May 18 '22 20:05 scottschurr

Hi @scottschurr, sorry for that. Will do right now, lost track of this thread. Thanks for the reminder.

I propose the codes below, where a non-standard HTTP code is only returned if one can expect (e.g. when submitting against a pool) another node to respond differently (first call: error, second call, other machine: OK) (except for calls needing auth).

https://gist.github.com/WietseWind/661776d0b4c849bfa5f237101104e02a

WietseWind avatar May 19 '22 11:05 WietseWind

Thanks for the help @WietseWind. I've added a commit with the extra status codes and a bit more code that was required to make the new status codes work. I think the status codes are in their final state for this pull request. It would be great if the reviewers would look it over and let me know if there are other things that need fixing. Thanks.

scottschurr avatar May 20 '22 03:05 scottschurr

@scottschurr Thanks, appreciate it :D

WietseWind avatar May 20 '22 07:05 WietseWind

Ping @cjcobb23 and @mDuo13. How do you folks feel about this pull request in its current state? Do you want any changes? Thanks.

scottschurr avatar Jun 15 '22 22:06 scottschurr

Makes sense @mDuo13, thanks for chiming in.

Thoughts regarding the version flag: if that same flag would also change the behaviour of other version related options, one would never be able to use the previous (current, default) version of the API but with the response codes that allows for failover in reverse proxy environments (for example). Not ideal from a consistency standpoint, but could it be worth triggering this behaviour based on a HTTP request header?

WietseWind avatar Jul 01 '22 19:07 WietseWind

Thoughts regarding the version flag: if that same flag would also change the behaviour of other version related options, one would never be able to use the previous (current, default) version of the API but with the response codes that allows for failover in reverse proxy environments (for example). Not ideal from a consistency standpoint, but could it be worth triggering this behaviour based on a HTTP request header?

I recommend having HTTP status code changes be the the only change in "api_version": 3.

Using a special HTTP request header isn't worth the hassle in code, documentation, or increased number of special cases. Version numbers are free* and we will not run out. Let's use them.

I have updated XLS-22d API Versioning to note that changing status codes is a breaking change.

*(Thanks to @thejohnfreeman for this link.)

intelliot avatar Jul 02 '22 00:07 intelliot

Note, I'm not saying that this is a perfect solution or avoids all potential problems. We are just trying to make the best trade-off given our goals and constraints. Using "api_version": 3 will require clients to adapt to the change that was made for "api_version": 2:

https://github.com/XRPLF/rippled/pull/3770 - this is also an example of how a change is gated behind api_version / apiVersion in code

I think clients will have to adapt anyway, because clients may have existing code that expects certain status codes. When they are ready to gain the benefits of automatic failover at the reverse proxy layer, then they should audit / test their code to ensure that they comply with the latest API, including both the new status codes and the signer_list location.

intelliot avatar Jul 02 '22 00:07 intelliot

I'm not too sure if HTTP return codes should reflect the underlying JSON-RPC calls, but the spec there is surprisingly sparse...

MarkusTeufelberger avatar Jul 03 '22 20:07 MarkusTeufelberger

I've pushed a commit that adds the HTTP status codes suggested by @mDuo13. It also gates returning this status codes based on the "ripplerpc" field of the original request being "3.0" or greater.

Let me know what y'all think.

scottschurr avatar Jul 08 '22 22:07 scottschurr

I think it's fine to have this use "ripplerpc": "3.0". In fact, that might actually be preferable; I hadn't thought of it as an option before. ripplerpc affects HTTP, but not WebSockets. And that's good since HTTP status codes don't apply to WebSockets anyway.

intelliot avatar Jul 08 '22 23:07 intelliot

I would wait for confirmation from @WietseWind before merging, though :)

intelliot avatar Jul 08 '22 23:07 intelliot

Also, it's important that this gets documented on xrpl.org. We should also document the difference between ripplerpc 1.0 and 2.0, if we haven't already.

intelliot avatar Jul 08 '22 23:07 intelliot

Rebased to 1.9.2. @mDuo13 and @WietseWind, are you good with this version of the code? Thanks.

scottschurr avatar Aug 04 '22 23:08 scottschurr

Rebased to 1.9.2. @mDuo13 and @WietseWind, are you good with this version of the code? Thanks.

Very happy with it :) Thank you!

WietseWind avatar Aug 05 '22 23:08 WietseWind

Thanks @mDuo13. If you think these changes are ready to be committed, could you please mark the pull request as approved? But, since you want more time to research the "2.0" case, then you may want to withhold approval until that research is complete. That's fine also.

@cjcobb23, would you like to weigh in on these changes as well? They may have some consequences in Clio.

scottschurr avatar Aug 22 '22 17:08 scottschurr

Thanks @mDuo13! @cjcobb23, it's now your call.

scottschurr avatar Nov 18 '22 21:11 scottschurr

Thanks for the review @cjcobb23! I've squashed and rebased the commit and marked it as passed.

scottschurr avatar Nov 29 '22 19:11 scottschurr