pyroscope icon indicating copy to clipboard operation
pyroscope copied to clipboard

Failed to upload profile with error: EOF of trees cache

Open koolay opened this issue 3 years ago • 8 comments

Agent can't upload profile.

upload profile: do http request: Post "http://xxx-pyroscope.xxx.com/ingest?aggregationType=&from=1644953850&name=xxx-xx-xx-test-499981905626664960%7B%7D&sampleRate=100&spyName=gospy&units=&until=1644953860": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

There are errors on pyroscope server.

time="2022-02-16T07:12:04.462396" level=error msg="trees cache for pyroscope.server.cpu{}:7:1564403200: EOF" file=" storage/storage_put.go:76"
time="2022-02-16T07:12:04.462490" level=error msg="trees cache for pyroscope.server.cpu{}:5:1644403200: EOF" file=" storage/storage_put.go:76"
time="2022-02-16T07:12:04.551946" level=error msg="trees cache for pyroscope.server.alloc_objects{}:5:1644403200: EOF" file=" storage/storage_put.go:76"
time="2022-02-16T07:12:04.554790" level=error msg="trees cache for pyroscope.server.alloc_space{}:7:1564403200: EOF" file=" storage/storage_put.go:76"
time="2022-02-16T07:12:04.554874" level=error msg="trees cache for pyroscope.server.alloc_space{}:5:1644403200: EOF" file=" storage/storage_put.go:76"
time="2022-02-16T07:12:04.557736" level=error msg="trees cache for pyroscope.server.inuse_objects{}:7:1564403200: EOF" file=" storage/storage_put.go:76"
time="2022-02-16T07:12:04.557839" level=error msg="trees cache for pyroscope.server.inuse_objects{}:5:1644403200: EOF" file=" storage/storage_put.go:76"
time="2022-02-16T07:12:04.647913" level=error msg="trees cache for pyroscope.server.inuse_space{}:5:1644403200: EOF" file=" storage/storage_put.go:76"

koolay avatar Feb 16 '22 07:02 koolay

Hi @koolay , thanks for reporting we'll look into this!

  • Can you let us know what version of Pyroscope you're using?
  • Also are you using the push or the pull integration?
  • Does this happen all the time or just occasionally?

Rperry2174 avatar Feb 16 '22 08:02 Rperry2174

@Rperry2174
Pyroscope's image is pyroscope/pyroscope:0.8.0, and with push mode.

koolay avatar Feb 16 '22 10:02 koolay

Looks like the data has been corrupted. Are these messages repeating with the same numbers in the key name (for example, pyroscope.server.cpu{}:5:1644403200) or they are changing? Do you see similar messages for applications other than pyroscope.server?

Could you please export and send us one of the affected chunks that cause the problem?

curl --fail -G -o /tmp/tree-eof --data-urlencode "k=t:pyroscope.server.cpu{}:5:1644403200" http://localhost:4040/debug/storage/export/trees

(It targets localhost, you may need to adjust the URL.)

The data is raw profile bytes, it does not contain any sensitive info like function names.

kolesnikovae avatar Feb 16 '22 11:02 kolesnikovae

Also, could you please clarify which Go client you are using:

  • github.com/pyroscope-io/client/pyroscope
  • github.com/pyroscope-io/pyroscope/pkg/agent/profiler

kolesnikovae avatar Feb 16 '22 12:02 kolesnikovae

It'd be interesting to know what the read timeout is, and establish the causality is between the timeout and the ingestion error, i.e.:

  • is the timeout caused by the ingestion error that somehow doesn't complete the request?
  • is the ingestion error caused by the timeout?
  • are they unrelated?

abeaumont avatar Feb 16 '22 12:02 abeaumont

@kolesnikovae The client is github.com/pyroscope-io/client v0.2.0.

@abeaumont I'm not sure that they are related.

koolay avatar Feb 16 '22 14:02 koolay

Is it related about the size of tree nodes? @kolesnikovae

koolay avatar Feb 23 '22 11:02 koolay

I'd say it should work fine unless the tree size reaches hundreds of megabytes. The relation between the error message and the HTTP timeout is pretty indirect - the code that causes the error does not depend on the HTTP connection (if the client refuses it, processing won't be interrupted), but, apparently, the server was unable to put the data into the storage on time.

My best guess is that the whole tree got corrupted due to unexpected shutdown or because of a bug. Unfortunately, the EOF error message (no more data can be read) does not allow us unambiguously identify the exact reason, therefore I'm asking you to provide us with the data sample.

Trees (profiles) are stored in the underlying KV database (BadgerDB), where the key looks like pyroscope.server.alloc_space{}:5:1644403200 and the value is the tree itself. The error message states that the tree could not be fetched from the intermediate cache, which in turn means that the tree is found in the DB but it either:

  • can't be read from the database. To me it says that the DB layer is affected; which is quite a rare occasion because of BadgerDB MVCC model - data is written in transactions, therefore in case of a failure we'll either end up with a previous tree version, or nothing. I can't say I saw anything similar.
  • can't be deserialised. With 99,9% certainty it's a bug.

kolesnikovae avatar Feb 23 '22 12:02 kolesnikovae