gorums icon indicating copy to clipboard operation
gorums copied to clipboard

Gorums cannot handle nil for non-PerNodeArg Gorums RPC calls.

Open wruiwr opened this issue 8 years ago • 10 comments
trafficstars

For my read-write distributed storage implemented by Gorums, if I send a nil message through either read or write quorum call, then the quorum call returns a nil reply with error nil.

The reason for this issue is the callGRPC methods generated by Gorums.

For example in method: callGRPCWrite:

if arg == nil {
    // send a nil reply to the for-select-loop
    replyChan <- internalWriteResponse{node.id, nil, nil}
    return
}

This allows the write quorum call to send a nil write request and reply a nil as the write response with error message nil. (the quorum function can receive enough replies to obtain a quorum)

So, in my quorum functions, I currently wrote some code to handle the nil replies, otherwise, it can cause panic, since there's no value and timestamp to get if replies are nils.

wruiwr avatar Sep 24 '17 08:09 wruiwr

I'm not sure I understand the issue here. gRPC doesn't allow passing nil, as the marshalling step will panic (as it should). It seems that we now protect against passing nil values on all methods generated by Gorums ( #12).

I think that passing nil values should panic, as this is the developer's mistake. There is however one special case where I think we should handle nil values: when a per-node-arg function returns nil, we should drop that request and reduce the expected number of replies for that call. This would allow us to do some cool things with the per-node-arg functions. What do you think @meling?

s111 avatar Sep 24 '17 10:09 s111

The quorum call actually can invoke gRPC calls with nil message and the quorum functions can also receive enough replies (also nil) to obtain a quorum.

So, the returned reply for this quorum call in this case is nil, however if the server failure happens, then no quorum is obtain, the returned reply is also nil. Therefore, I think in my tests, this two cases return the same values (both nils) (one is not an error (the err is nil) according to Hein, another is caused by the server failure), so I can not check the difference.

wruiwr avatar Sep 24 '17 11:09 wruiwr

I think Hein has a solution about this issue. ;)

wruiwr avatar Sep 24 '17 11:09 wruiwr

A quorum call works with nil values as we circumvent the actual gRPC call:

if arg == nil {
	       // send a nil reply to the for-select-loop
	      replyChan <- internalWriteResponse{node.id, nil, nil}
	      return
}

An actual gRPC call would panic with:

panic: rpc error: code = Internal desc = grpc: error while marshaling: proto: Marshal called with nil

@meling: If you look at config_qc_test.go#L928, a request is ignored but the nil values is still added to the reply slice, i.e., we are allowing a write request which needs a quorum of at least 3 with only 2 actual responses. The call succeeds as len(replies) is > 2 (as the nil values is counted).

See PR #36 for a potential fix.

s111 avatar Sep 24 '17 11:09 s111

It seems my quick and dirty initial solution was a bit too simplistic; I was hoping to avoid the extra noise around the {{if .PerNodeArg}} conditional in the template. I've made some more changes to the PR, but there is one outstanding issue that needs to be discussed. It turns out that providing a nil argument does not panic in all cases. Take a look at the test in config_qc_test.go#L140. This does not panic when its argument is nil, whereas a similar test with the WritePerNode method does panic; see config_qc_test.go#L956. This is presumably because the Read method takes a ReadRequest, which is actually never used on the server-side. So it seems that it is allowed by gRPC to invoke with a nil argument, e.g., when the server side does not need anything from the client. Will think some more about this. Thoughts anyone?

meling avatar Sep 25 '17 01:09 meling

Added another commit to the fix-nil-in-replies branch. Turns out that it is not easy to check for panic from the Write call since the panic actually happens in another goroutine separate from the calling goroutine running the tests, and so the standard recover() call does not work. One solution could be to check for nil argument to the quorum call and panic immediately, but then also the Read method will fail, even though it works now.

meling avatar Sep 25 '17 02:09 meling

The routine that caused the panic needs to invoke recover, or we might never learn about the panic (expect if we sync with the routine at some point). Maybe we can pass an err chan error to the function invoked as a go routine, that must be read before checking the actual response value?

s111 avatar Sep 27 '17 17:09 s111

My initial thoughts on this is that it seems to be a bit much hassle for handling nil which nobody really should be passing anyway, even though it works for the Read call. I'm continuing to ponder this.

meling avatar Sep 27 '17 17:09 meling

Yes, that's what I'm thinking as well, we should not try to code ourselves around developer error, let it panic. Though, we should recover a crashing go routine and communicate that back to the calling quorum call, so that the panic is actually caught. I.e., the panic needs to be invoked on the main thread so that the program actually panics.

s111 avatar Sep 27 '17 18:09 s111

Don't you think it is a bit of an overkill to complicate things by recovering individual goroutines in order to pass that on to the main goroutine. Not sure if that helps debug the actual problem, especially if the panic is caused by another issue that we haven't thought about yet. I think I prefer to let it panic as it does now. Though we cannot write a test for nil and recover from it.

meling avatar Sep 28 '17 18:09 meling