ghc-events
ghc-events copied to clipboard
Event parsers and serializers should have roundtrip property
By "roundtrip property" I mean for any event if I serialize it with putEvent
,
getEvent
on the resulting ByteString
should give me the original event. #41
shows that we currently don't have this property.
Motivation
I'm trying to improve large eventlog handling of ThreadScope [1]. For this I need some kind of filesystem-backed data structure to be able to serialize unused events [2] to a file and read the ones that are needed. As shown in #41 we currently can't do this.
Problem
The reason why we currently don't have this property is GHC expects that event
parsers should be able to parse an event using only a prefix of the
serialization of the event, and ghc-events
is implemented accordingly. This
means that GHC can add more fields to an event as long as (1) it updates the
event size in the .eventlog file header (2) it puts the new field at the end of
the event. [3] is an example where this happens. Event parsers can then check
size of an event using the .eventlog file header, and skip any space between
the assumed size of the event and what the header says the event's size is [4].
The problem occurs because the serializer (putEvent
and friends) do not take
these extra fields into account and serializes events based on their assumed
size by ghc-events
[5]. So if we try to parse the serialized event using the
original parser used to read the .eventlog file the parse fails.
Proposed fix
I propose fixing this by adding a new "extras" field to events:
data Event = Event
{ evTime :: {-# UNPACK #-}!Timestamp
, evSpec :: EventInfo
, evCap :: Maybe Int
, evExtras :: ByteString -- NEW FIELD
} deriving Show
Parsers can then put the skipped data in this field and serializers can directly serialize that field after serializing the event.
Because this is a breaking change a major version bump will be needed. A major
user is ThreadScope, which already has an upper bound <0.9.0
so I don't think
this will cause major breakage, and I'll also update ThreadScope so that the
next version will be compatible with new ghc-events.
[1]: https://github.com/haskell/ThreadScope/issues/29
[2]: e.g. with the help of an LRU cache
[3]: https://phabricator.haskell.org/D3658
[4]: See padParser
and its call site for the implementation of this.
[5]: In putEventSpec
How does that sound? @maoe I think these days you're maintaining the library, is there anyone else I should ping about this? The implementation should be simple but I just wanted to get some opinions before investing more into it.
I implemented this and it works nicely. I can parse a 225M eventlog file and for every event check if I can serialize and deserialize it. I'll submit a PR soon.