metrictank icon indicating copy to clipboard operation
metrictank copied to clipboard

msgp: too few bytes left to read object

Open Dieterbe opened this issue 6 years ago • 5 comments

2018/03/16 10:29:00 [dataprocessor.go:221 func1()] [E] DP getTargetsRemote: error unmarshaling body from mt-read00-12574-medium-ops-b-2445396050-vdcg2/getdata: "msgp: too few bytes left to r>
2018/03/16 10:29:00 [graphite.go:766 executePlan()] [E] HTTP Render msgp: too few bytes left to read object
[Macaron] 2018-03-16 10:29:00: Completed /render 500 Internal Server Error in 129.391075ms

Dieterbe avatar Mar 16 '18 11:03 Dieterbe

I'm seeing these fairly frequently. Any idea what the issue is?

shanson7 avatar Mar 22 '18 15:03 shanson7

I deployed a “silent node” (carbon in, partition 9999) and added some debug statements

It turns out the buffers are coming back as nil 2018/05/09 21:05:26 [dataprocessor.go:223 func1()] [E] DEBUG len(buf)=0, is nil:true

It seems like we are getting nil buffers back from the peers when the request gets canceled. Adding more logging I see 2018/05/09 21:30:24 [dataprocessor.go:216 func1()] [E] DP getTargetsRemote: error with POST to metrictank-read-046-1/getdata: "500 Internal Server Error"

Looking at that time for metrictank-read-046-1 I see 2018/05/09 21:30:24 [cluster.go:191 getData()] [E] HTTP getData() start must be before end.

That comes from cassandra store. Likely something to do with this logic: https://github.com/grafana/metrictank/blob/master/api/dataprocessor.go#L537

shanson7 avatar May 09 '18 22:05 shanson7

I think this is ccache corruption. For this particular repro request it was always the same instance that was breaking things. I sent a ccache/delete request and now the error is gone for this repro

shanson7 avatar May 09 '18 22:05 shanson7

This occurred for me during a schema update and was not related to the ccache at all on version 0.9.0. Once schemas were the same on all servers this went away.

tehlers320 avatar Jun 19 '18 15:06 tehlers320

Sorry, to clarify:

  1. The main issue of msgp: too few bytes left to read object is coming from here. This happens when the request to the peer is canceled because another peer has returned an error (so the buffer is nil and not eligible for unmarshaling). The fix for this is probably to just check if the request was canceled before unmarshaling.

  2. This means that there is another problem that is causing the error to be returned. In my specific case it is some ccache corruption.

shanson7 avatar Jun 27 '18 16:06 shanson7