metrictank
metrictank copied to clipboard
msgp: too few bytes left to read object
2018/03/16 10:29:00 [dataprocessor.go:221 func1()] [E] DP getTargetsRemote: error unmarshaling body from mt-read00-12574-medium-ops-b-2445396050-vdcg2/getdata: "msgp: too few bytes left to r>
2018/03/16 10:29:00 [graphite.go:766 executePlan()] [E] HTTP Render msgp: too few bytes left to read object
[Macaron] 2018-03-16 10:29:00: Completed /render 500 Internal Server Error in 129.391075ms
I'm seeing these fairly frequently. Any idea what the issue is?
I deployed a “silent node” (carbon in, partition 9999) and added some debug statements
It turns out the buffers are coming back as nil
2018/05/09 21:05:26 [dataprocessor.go:223 func1()] [E] DEBUG len(buf)=0, is nil:true
It seems like we are getting nil buffers back from the peers when the request gets canceled. Adding more logging I see
2018/05/09 21:30:24 [dataprocessor.go:216 func1()] [E] DP getTargetsRemote: error with POST to metrictank-read-046-1/getdata: "500 Internal Server Error"
Looking at that time for metrictank-read-046-1 I see
2018/05/09 21:30:24 [cluster.go:191 getData()] [E] HTTP getData() start must be before end.
That comes from cassandra store. Likely something to do with this logic: https://github.com/grafana/metrictank/blob/master/api/dataprocessor.go#L537
I think this is ccache corruption. For this particular repro request it was always the same instance that was breaking things. I sent a ccache/delete request and now the error is gone for this repro
This occurred for me during a schema update and was not related to the ccache at all on version 0.9.0. Once schemas were the same on all servers this went away.
Sorry, to clarify:
-
The main issue of
msgp: too few bytes left to read object
is coming from here. This happens when the request to the peer is canceled because another peer has returned an error (so the buffer is nil and not eligible for unmarshaling). The fix for this is probably to just check if the request was canceled before unmarshaling. -
This means that there is another problem that is causing the error to be returned. In my specific case it is some ccache corruption.