rippled
rippled copied to clipboard
RPC tooBusy response now has 503 HTTP status code:
High Level Overview of Change
Fixes #4005
Makes it possible for internal RPC Error Codes to associate themselves with a non-OK (200) HTTP status code. This commit also adds non-OK HTTP responses to a few additional RPC error responses.
Context of Change
Issue #4005 notes that returning an HTTP 200 (OK) status code when the server is overloaded is misleading for load balancers. This proposed change allows each RPC error response to be associated with its own HTTP status code.
If an explicit HTTP status code is not specified, then 200 (OK) is assumed. This preserves much of the status quo since prior to this change a 200 (OK) status code was returned in all cases.
The pull request includes a best guess at other cases where a non-200 status code should be returned. Reviewers are requested to look those over carefully and recommend changes. Thanks.
Type of Change
- [x] Potentially breaking change (fix or feature that could cause existing functionality to not work as expected).
The change is potentially breaking in cases where servers are expecting to get a 200 (OK) status code back since they will no longer get that status code in certain error cases. It's unknown whether that problem would occur in practice.
There is no amendment associated with the change, since it does not change transaction results. However the change does affect the API. So care is justified.
Before / After
I hacked the server_info command to return an rpc_TOO_BUSY error, then I used the following curl command:
#!/bin/bash
# Use curl to make a server_info request
curl -i -d '{ "method": "server_info", "params": [ {"ripplerpc": "1.0"} ] }' -H 'Content-Type: application/json' -X POST http://127.0.0.1:5005/
Without the change I get the following response:
HTTP/1.1 200 OK
Date: Mon, 11 Apr 2022 17:53:55 +0000
Connection: Keep-AliveContent-Length: 195
Content-Type: application/json; charset=UTF-8
Server: ripple-json-rpc/rippled-1.9.0+7c66747d27869f9f3c96617bd4227038f1fa92b8.DEBUG
{"result":{"error":"tooBusy","error_code":9,"error_message":"The server is too busy to help you now.","request":{"command":"server_info","ripplerpc":"1.0"},"status":"error"},"ripplerpc":"1.0"}
However with the change I get the following response (note the changed status code):
HTTP/1.1 503 Server is overloaded
Date: Mon, 11 Apr 2022 18:03:41 +0000
Connection: Keep-AliveContent-Length: 195
Content-Type: application/json; charset=UTF-8
Server: ripple-json-rpc/rippled-1.9.0+7c66747d27869f9f3c96617bd4227038f1fa92b8.DEBUG
{"result":{"error":"tooBusy","error_code":9,"error_message":"The server is too busy to help you now.","request":{"command":"server_info","ripplerpc":"1.0"},"status":"error"},"ripplerpc":"1.0"}
Test Plan
The rippled testing infrastructure currently ignores the returned HTTP status code. I could figure out how to fix that. But I'd like to suggest that we content ourselves with integration tests. Potentially the integration tests could use curl commands similar to what I used for manual testing. However we'd need to actually overload the server, rather than hacking a command to return an inaccurate result.
More thoughts are appreciated regarding testing.
Reviewers
I'd appreciate review and comments from @WietseWind who wrote the initial issue.
Thank you @scottschurr! Great solution, being able to configure alternative error codes per error.
I see that you already added error codes for the most important errors/codes. Going through the list I feel there are many more eligible for (mostly) 400 / 401 / 403 / 500 HTTP status codes (e.g. rpcFORBIDDEN, rpcNOT_READY, rpcNO_NETWORK, rpcJSON_RPC, rpcNO_PERMISSION).
Question is where to draw the line between something the client asked for (response still OK, 200, to be expected) and where it's a response that should explicitly inform the client that something is off.
What originally caused me to post #4005 is the load balancer auto-failover use case. I feel that's where we should draw the line: if a query fails in the 4xx/5xx range and the exact same query is likely to have a positive (200) outcome on another node, it should return a non-200 status code.
If you agree I'll go through the list and add some more 4xx/5xx status codes. But other than adding some status codes: perfect. Thanks again!
@WietseWind, thanks for your comment. I'm not really a load balancer kind of guy. I mostly keep my nose in the C++. So I'm happy to use your suggestions for which status codes should be returned. Then I'll rely on @cjcobb23 and @intelliot to verify that they look right from their perspectives.
So, yes, please go through the list and add some more 4xx/5xx status codes. I'll also add a comment regarding how the line was drawn between 200 and other. Thanks!
@WietseWind, I'm hoping you'll have time to indicate where you'd like additional HTTP status codes added. Is there anything you need from me to push that forward? Thanks.
Hey @WietseWind, on April 11th you noted:
If you agree I'll go through the list and add some more 4xx/5xx status codes.
That's great and would be appreciated. But you've had five weeks now to do that. I know you're really busy. If you can get back to me within the next week to at least let me know when you can get around to identifying those additional status codes I'd really appreciate it.
If I don't hear back from you within the week I'll assume you're too busy to make the assessment right now and we can commit the HTTP status code changes that are currently in the pull request. If that should happen, then you can make a later pull request to put in the additional status codes.
Thanks for the help.
Hi @scottschurr, sorry for that. Will do right now, lost track of this thread. Thanks for the reminder.
I propose the codes below, where a non-standard HTTP code is only returned if one can expect (e.g. when submitting against a pool) another node to respond differently (first call: error, second call, other machine: OK) (except for calls needing auth).
https://gist.github.com/WietseWind/661776d0b4c849bfa5f237101104e02a
Thanks for the help @WietseWind. I've added a commit with the extra status codes and a bit more code that was required to make the new status codes work. I think the status codes are in their final state for this pull request. It would be great if the reviewers would look it over and let me know if there are other things that need fixing. Thanks.
@scottschurr Thanks, appreciate it :D
Ping @cjcobb23 and @mDuo13. How do you folks feel about this pull request in its current state? Do you want any changes? Thanks.
Makes sense @mDuo13, thanks for chiming in.
Thoughts regarding the version flag: if that same flag would also change the behaviour of other version related options, one would never be able to use the previous (current, default) version of the API but with the response codes that allows for failover in reverse proxy environments (for example). Not ideal from a consistency standpoint, but could it be worth triggering this behaviour based on a HTTP request header?
Thoughts regarding the version flag: if that same flag would also change the behaviour of other version related options, one would never be able to use the previous (current, default) version of the API but with the response codes that allows for failover in reverse proxy environments (for example). Not ideal from a consistency standpoint, but could it be worth triggering this behaviour based on a HTTP request header?
I recommend having HTTP status code changes be the the only change in "api_version": 3.
Using a special HTTP request header isn't worth the hassle in code, documentation, or increased number of special cases. Version numbers are free* and we will not run out. Let's use them.
I have updated XLS-22d API Versioning to note that changing status codes is a breaking change.
*(Thanks to @thejohnfreeman for this link.)
Note, I'm not saying that this is a perfect solution or avoids all potential problems. We are just trying to make the best trade-off given our goals and constraints. Using "api_version": 3 will require clients to adapt to the change that was made for "api_version": 2:
https://github.com/XRPLF/rippled/pull/3770 - this is also an example of how a change is gated behind api_version / apiVersion in code
I think clients will have to adapt anyway, because clients may have existing code that expects certain status codes. When they are ready to gain the benefits of automatic failover at the reverse proxy layer, then they should audit / test their code to ensure that they comply with the latest API, including both the new status codes and the signer_list location.
I'm not too sure if HTTP return codes should reflect the underlying JSON-RPC calls, but the spec there is surprisingly sparse...
I've pushed a commit that adds the HTTP status codes suggested by @mDuo13. It also gates returning this status codes based on the "ripplerpc" field of the original request being "3.0" or greater.
Let me know what y'all think.
I think it's fine to have this use "ripplerpc": "3.0". In fact, that might actually be preferable; I hadn't thought of it as an option before. ripplerpc affects HTTP, but not WebSockets. And that's good since HTTP status codes don't apply to WebSockets anyway.
I would wait for confirmation from @WietseWind before merging, though :)
Also, it's important that this gets documented on xrpl.org. We should also document the difference between ripplerpc 1.0 and 2.0, if we haven't already.
Rebased to 1.9.2. @mDuo13 and @WietseWind, are you good with this version of the code? Thanks.
Rebased to 1.9.2. @mDuo13 and @WietseWind, are you good with this version of the code? Thanks.
Very happy with it :) Thank you!
Thanks @mDuo13. If you think these changes are ready to be committed, could you please mark the pull request as approved? But, since you want more time to research the "2.0" case, then you may want to withhold approval until that research is complete. That's fine also.
@cjcobb23, would you like to weigh in on these changes as well? They may have some consequences in Clio.
Thanks @mDuo13! @cjcobb23, it's now your call.
Thanks for the review @cjcobb23! I've squashed and rebased the commit and marked it as passed.