ghc-events
ghc-events copied to clipboard
WIP: Constant space sort
This introduces an interface providing constant-space sorting of eventlogs via on-disk merge-sort. I have confirmed that the implementation indeed maintains constant-space behavior, requiring about 79s seconds to sort a 200MB eventlog while not exceeding 40 MBytes residency (using a chunk size of 1e5 events).
This is in contrast to the old in-memory codepath which requires 35 seconds and nearly 5GBytes of residency to sort this same eventlog.
If I increase the chunk size by an order of magnitude (to 1e6 events) then the constant-space sort has essentially the same runtime as the in-memory codepath but runs in merely 300MBytes.
Note: In testing this I realized that several of the encoding paths treated strings incorrectly. Consequently, this depends upon (and contains commits from) #62.
To-do
- [ ] add a test
- [x] ensure that temporary files are cleaned up on failure
- [x] documentation
- [x] optimise throughput
- [ ] sort out how this should be exposed in the interface; should it supercede the existing
sortEvents
? - [x] split out string encoding fixes into new MR
- [x] decide on a sensible chunk size (decided on 5e5 as a reasonable compromise between memory and throughput)
Is it necessary to sort all the events as if they were completely unordered? I was assuming, at the time I wrote my patch, that the events of each capability would come in order within each capability, so all that was necessary was merging the capability streams back again.
Perhaps there is some inconvenience in splitting the capability streams that I'm missing?
I think facundominguez's point is valid. Also it seems that #14 gets in the way if we're going to replace the existing sortEvents with this constant space sorting.