ghc-events icon indicating copy to clipboard operation
ghc-events copied to clipboard

Index out of range in `ghc-events show` and dependencies of `ghc-events`

Open MangoIV opened this issue 1 year ago • 24 comments

Hi! I have had a couple of problems with eventlog2html and hs-speedscope recently and they seem to be a problem either with the library or the eventlog that the ghc RTS emits. I can more or less (sometimes it doesn't happen) reliably reproduce this error with the following program:

module Main where

import Data.ByteString (ByteString)
import qualified Data.ByteString as BS
import Data.Foldable (for_)
import Data.Monoid (Endo (..))

largeEndo :: Endo ByteString
largeEndo = Endo \x -> BS.replicate 100 97 <> x

manyEndos :: Endo ByteString -> Endo ByteString
manyEndos = mconcat . replicate 10_000

main :: IO ()
main = for_ [1 .. 30] \_ ->
  BS.putStr $ appEndo (manyEndos largeEndo) mempty 

compile with ghc -rtsopts -prof -fprof-late -O0 ./bla.hs run with ./bla +RTS -hc -p -l-au

I have not tried to further reduce the example.

This happens on ghc 9.6, 9.8 and 9.10 at least, according to @TeofilC on ghc-events HEAD, and then at least on ghc-events 0.19.0.1.

It affects not only ghc-events but also hs-speedscope and eventlog2html.

related issue on eventlog2html issue tracker: https://github.com/mpickering/eventlog2html/issues/136

MangoIV avatar Aug 13 '24 14:08 MangoIV

I think we're failing in GHC.RTS.Events.Binary.getEvent :: EventParsers -> Get (Maybe Event) where we try to index into the parsers and instead of getting an EventTypeNum within range, we get a really big one.

MangoIV avatar Aug 13 '24 14:08 MangoIV

Running ghc-events with an eventlog from the reproducer triggers an assertion failure at this line: https://github.com/haskell/ghc-events/blob/2168f610bf3580a3a4dca7b0582c6384d5433413/src/GHC/RTS/Events/Binary.hs#L758-L765 It seems like the payloadLen is wrong for certain EVENT_PROF_SAMPLE_COST_CENTRE events

TeofilC avatar Aug 13 '24 15:08 TeofilC

jup, the last four parses before it blows up look like this:

evSpec: HeapProfSampleCostCentre {heapProfId = 0, heapProfResidency = 912, heapProfStackDepth = 1, heapProfStack = [78]}, etRef: 163
evSpec: ProfSampleCostCentre {profCapset = 369098752, profTicks = 4333765376, profStackDepth = 0, profCcsStack = []}, etRef: 167
evSpec: HeapProfSampleString {heapProfId = 0, heapProfResidency = 2029359963648, heapProfLabel = "\SO"}, etRef: 164
evSpec: CreateThread {thread = 22784}, etRef: 0

It looks like the parse of ProfSampleCostCentre introducest the corruption.

MangoIV avatar Aug 13 '24 15:08 MangoIV

@TeofilC is it possible that this is the Ccs stack being too deep for the profStackDepth Word8? And then it overflows and we don't read the Vector to end?

MangoIV avatar Aug 13 '24 15:08 MangoIV

That doesn't seem to be it, you're right, the payloadLen before that is already completely off.

MangoIV avatar Aug 13 '24 15:08 MangoIV

Yes on the GHC side we truncate to 255, which should be fine

TeofilC avatar Aug 13 '24 15:08 TeofilC

So afaiu this means that either the parsing of the header on the ghc-events side or the writing of the header on ghc RTS side goes wrong. (At least the previous events look fine, so I'm not assuming that they're already introducing the corruption)

MangoIV avatar Aug 13 '24 16:08 MangoIV

Strangely enough I can't seem to reproduce this with -threaded. I get the impression that an EVENT_PROF_SAMPLE_COST_CENTRE and an EVENT_HEAP_PROF_SAMPLE_COST_CENTRE event being written at the same time is the cause of this. Yet, I'm not sure how that would be possible with the non-threaded runtime!

TeofilC avatar Aug 13 '24 17:08 TeofilC

Also don’t they acquire the global eventBuf lock before writing?

MangoIV avatar Aug 13 '24 17:08 MangoIV

I also can't reproduce if I only do heap profiling or time profiling with the non-threaded RTS.

So it seems like we need to be doing both with the non-threaded RTS.

I think it's highly likely that somehow we try to write both a heap sample and a time sample at the same time to the eventlog

TeofilC avatar Aug 13 '24 17:08 TeofilC

Also don’t they acquire the global eventBuf lock before writing?

Ah but that only exists for the threaded RTS

TeofilC avatar Aug 13 '24 17:08 TeofilC

I think it's highly likely that somehow we try to write both a heap sample and a time sample at the same time to the eventlog

and the -threaded safe guards against that because it does proper locking while the non-threaded RTS doesn’t but is still somehow concurrent? That’s weird

MangoIV avatar Aug 13 '24 17:08 MangoIV

What seems to be happening is:

  1. we are running a heap profile
  2. we write half of a heap sample event
  3. the time profile timer triggers
  4. we pause what we are doing, and write a time sample event in the middle of our heap event
  5. we unpause and finish writing our heap event

So we end up with something garbled.

This story is backed up by putting a bunch of traces inside the eventlog printing functions in the RTS. This is the order of events they suggest

TeofilC avatar Aug 13 '24 17:08 TeofilC

So there must be a context switch somewhere in dumpCensus, even with the non-threaded RTS?

MangoIV avatar Aug 13 '24 17:08 MangoIV

Is it possible that this is happening because the time profile is running asynchronously? (see initTimer -> initTicker -> createAttachedOSThread with handle_tick -> handleProfTick -> traceProfSampleCostCentre) maybe there should be a lock for writing to the eventlog?

MangoIV avatar Aug 13 '24 18:08 MangoIV

so maybe it would work if we'd just keep the ACQUIRE_LOCK stuff in existence, even in the non-threaded runtime? :eyes:

MangoIV avatar Aug 13 '24 18:08 MangoIV

@TeofilC I don't think that you observation is generally right - I was going to try if I can completely circumvent the issue by using -threaded or using only one of the two traces but I cannot - with a more elaborate example, which is probably a bit too bulky to share, I still really often (~30% of the time) get this problem.

MangoIV avatar Aug 14 '24 12:08 MangoIV

Is there perhaps some similar issue where the initialisation events (cost centre definitions) are being posted to the output, and that is interupted by a heap profile event before all of the definitions are dumped.

Does it happen if you are not using a profiled executable? (ie, don't compile with -prof and using -hT -l)

mpickering avatar Aug 14 '24 13:08 mpickering

Another thing to try is a longer profiling interval (-i10), does that fix the issue?

mpickering avatar Aug 14 '24 13:08 mpickering

Interesting @MangoIV . It sounds like there's potentially multiple bugs. In your larger example, maybe you could try to find the last few events before the eventlog gets corrupted. That might help suggest which event is going wrong

TeofilC avatar Aug 14 '24 16:08 TeofilC

I've written up the bug we found here: https://gitlab.haskell.org/ghc/ghc/-/issues/25165 We should keep looking for the other issues, but I wanted to make sure we didn't forget about this one

TeofilC avatar Aug 15 '24 11:08 TeofilC

@mpickering

  • i10 doesn't fix the issue, this made it so that I got one of the eventlogs through, so I guess it decreases the probability for failure but most still fail
  • another thing to note is that I'm killing the application in question with keyboard signal, so that may be something to consider, as well
  • profiling-detail is late
  • it seems to happen at least a lot less often, if not not at all with -hT -l

MangoIV avatar Aug 20 '24 08:08 MangoIV

This seems to confirm what @TeofilC suggested about time profiling events interrupting the writing of other events and leading to corruption

mpickering avatar Aug 20 '24 09:08 mpickering

it also happens when only using -hy, no time profiling. Is time profiling still doing things when we only request heap profiling with -l?

MangoIV avatar Aug 20 '24 09:08 MangoIV