ouroboros-network icon indicating copy to clipboard operation
ouroboros-network copied to clipboard

Draft ledger ops benchmark in db-analyser

Open nfrisby opened this issue 2 years ago • 9 comments

This PR is draft because

  • it's experimental
  • it targets an old commit; I used all upstream commits that the master tip of cardano-node refers to

Fixes #4052

nfrisby avatar Sep 19 '22 17:09 nfrisby

I invoke the tool like this:

$(cabal list-bin exe:db-analyser) --db state-node-mainnet/db-mainnet/ --only-immutable-db cardano --config mainnet-config.json --analyse-from 42942407 --benchmark-ledger-ops 2>/dev/null 1>db-analyser.log

and then feed it to this Gnuplot script

set terminal pngcairo size 1600,800
set border back
set output "bench-ledger-ops.png"

set xlabel "slot"
set ylabel "dur (μs)"

set multiplot layout 2,1

plot 'db-analyser.log' using 1:3 with impulses title "both",\
     'db-analyser.log' using 1:4 title "mut",\
     'db-analyser.log' using 1:5 title "gc"

# draw a vertical line at each every Shelley boundary
do for [sl in "4492800 4924800 5356800 5788800 6220800 6652800 7084800 7516800 7948800 8380800 8812800 9244800 9676800 10108800 10540800 10972800 11404800 11836800 12268800 12700800 13132800 13564800 13996800 14428800 14860800 15292800 15724800 16156800 16588800 17020800 17452800 17884800 18316800 18748800 19180800 19612800 20044800 20476800 20908800 21340800 21772800 22204800 22636800 23068800 23500800 23932800 24364800 24796800 25228800 25660800 26092800 26524800 26956800 27388800 27820800 28252800 28684800 29116800 29548800 29980800 30412800 30844800 31276800 31708800 32140800 32572800 33004800 33436800 33868800 34300800 34732800 35164800 35596800 36028800 36460800 36892800 37324800 37756800 38188800 38620800 39052800 39484800 39916800 40348800 40780800 41212800 41644800 42076800 42508800 42940800 43372800 43804800 44236800 44668800 45100800 45532800 45964800 46396800 46828800 47260800 47692800 48124800 48556800 48988800 49420800 49852800 50284800 50716800 51148800 51580800 52012800 52444800 52876800 53308800 53740800 54172800 54604800 55036800 55468800 55900800 56332800 56764800 57196800 57628800 58060800 58492800 58924800 59356800 59788800 60220800 60652800 61084800 61516800 61948800 62380800 62812800 63244800 63676800 64108800 64540800 64972800 65404800 65836800 66268800 66700800 67132800 67564800 67996800 68428800 68860800 69292800 69724800 70156800 70588800 71020800 71452800 71884800 72316800 72748800 73180800 73612800 74044800 74476800 74908800 75340800 75772800 76204800 76636800 77068800 77500800 77932800 78364800 78796800 79228800 79660800 80092800 80524800 80956800 81388800 81820800 82252800 82684800 83116800 83548800 83980800 84412800 84844800 85276800 85708800 86140800 86572800 87004800 87436800"] {
  set arrow from sl, screen 0.5 to sl, screen 0.6 nohead ls 28
  set arrow from sl, screen 0 to sl, screen 0.1 nohead ls 28
}

# draw a heavier line for the epoch transitions that are also HFC era transitions
do for [sl in "4492800 16588800 23068800 39916800"] {
  set arrow from sl, screen 0.5 to sl, screen 0.575 nohead ls 28 linewidth 5
  set arrow from sl, screen 0.0 to sl, screen 0.075 nohead ls 28 linewidth 5
}

plot 'db-analyser.log' using 1:4 with impulses title "mut",\
     'db-analyser.log' using 1:7 title "forcst",\
     'db-analyser.log' using 1:8 title "hdrTik",\
     'db-analyser.log' using 1:9 title "hdrApp",\
     'db-analyser.log' using 1:10 title "blkTik",\
     'db-analyser.log' using 1:11 title "blkApp" 

unset multiplot


This time, I interrupted the tool after a few minutes, so the tool didn't get through very many slots. It yielded the following chart. I used the commits from the cabal.project of cardano-node as of commit 36b5f661e (a recent master).

The x-axis is the slot number of the block, the y-axis is duration in microseconds. The extra-tall orange ticks demarcate the first slot of each epoch. Legend semantics:

  • both means elapsed_ns from https://hackage.haskell.org/package/base-4.17.0.0/docs/GHC-Stats.html
  • mut means mutator_elapsed_ns
  • gc means gc_elapsed_ns
  • the second plot series corresponds to the steps in code being instrumented:

https://github.com/input-output-hk/ouroboros-network/blob/3961b4d6bce2b90a95df22fd0226e2873638a758/ouroboros-consensus-cardano/tools/db-analyser/Analysis.hs#L515-L548

bench-ledger-ops


That top chart shows both significant mutator and GC time elapsed at the first block of each epoch (these epochs were still d>0, so the first slot was always filled by an overlay schedule block).

The second chart shows how that mutator time decomposes: mostly block ticking (not surprising) and header application (surprising!). I haven't checked yet, but strongly anticipate that this bang pattern isn't sufficient to force the entire forecast operation, and so significant portions of the duration of the forecast computation are instead being forced by (and so attributed to) the header application computation. (Edit: see PR input-output-hk/ouroboros-network#4028 -- my suspicion was correct and the plots in later comments on this PR used that patch)

https://github.com/input-output-hk/ouroboros-network/blob/3961b4d6bce2b90a95df22fd0226e2873638a758/ouroboros-consensus-cardano/tools/db-analyser/Analysis.hs#L519

nfrisby avatar Sep 19 '22 17:09 nfrisby

bench-ledger-ops

nfrisby avatar Sep 20 '22 05:09 nfrisby

Here's the entire history. The only thing you might add to this plot is the Byron epoch delimiters---we chose not too because they'd be very dense.

bench-ledger-ops

nfrisby avatar Sep 21 '22 23:09 nfrisby

And the same plot, for just the youngest 100k blocks. bench-ledger-ops-most-recent-100k-blocks

nfrisby avatar Sep 21 '22 23:09 nfrisby

And finally for just the youngest 500 blocks -- note that these happen to not include a spike, so the y-axis is range is much smaller.

bench-ledger-ops-most-recent-500-blocks

nfrisby avatar Sep 21 '22 23:09 nfrisby

Here's my latest gnuplot script, which I invoke like this nix-shell -p gnuplot --run "gnuplot -e \"benchmarkLedgerOpsLogFilePath='db-analyser7.log'\" gp.scr"

set terminal pngcairo size 1600,800
set border back
set output "bench-ledger-ops.png"

set xlabel "slot"
set ylabel "dur (μs)"

set multiplot layout 2,1

plot benchmarkLedgerOpsLogFilePath using 1:3 with impulses title "both",\
     benchmarkLedgerOpsLogFilePath using 1:4 title "mut",\
     benchmarkLedgerOpsLogFilePath using 1:5 title "gc"

# draw a vertical line at each every Shelley boundary
do for [sl in "4492800 4924800 5356800 5788800 6220800 6652800 7084800 7516800 7948800 8380800 8812800 9244800 9676800 10108800 10540800 10972800 11404800 11836800 12268800 12700800 13132800 13564800 13996800 14428800 14860800 15292800 15724800 16156800 16588800 17020800 17452800 17884800 18316800 18748800 19180800 19612800 20044800 20476800 20908800 21340800 21772800 22204800 22636800 23068800 23500800 23932800 24364800 24796800 25228800 25660800 26092800 26524800 26956800 27388800 27820800 28252800 28684800 29116800 29548800 29980800 30412800 30844800 31276800 31708800 32140800 32572800 33004800 33436800 33868800 34300800 34732800 35164800 35596800 36028800 36460800 36892800 37324800 37756800 38188800 38620800 39052800 39484800 39916800 40348800 40780800 41212800 41644800 42076800 42508800 42940800 43372800 43804800 44236800 44668800 45100800 45532800 45964800 46396800 46828800 47260800 47692800 48124800 48556800 48988800 49420800 49852800 50284800 50716800 51148800 51580800 52012800 52444800 52876800 53308800 53740800 54172800 54604800 55036800 55468800 55900800 56332800 56764800 57196800 57628800 58060800 58492800 58924800 59356800 59788800 60220800 60652800 61084800 61516800 61948800 62380800 62812800 63244800 63676800 64108800 64540800 64972800 65404800 65836800 66268800 66700800 67132800 67564800 67996800 68428800 68860800 69292800 69724800 70156800 70588800 71020800 71452800 71884800 72316800 72748800 73180800 73612800 74044800 74476800 74908800 75340800 75772800 76204800 76636800 77068800 77500800 77932800 78364800 78796800 79228800 79660800 80092800 80524800 80956800 81388800 81820800 82252800 82684800 83116800 83548800 83980800 84412800 84844800 85276800 85708800 86140800 86572800 87004800 87436800"] {
  set arrow from sl, screen 0.5 to sl, screen 0.6 nohead ls 28
  set arrow from sl, screen 0.0 to sl, screen 0.1 nohead ls 28
}

# draw a heavier line for the epoch transitions that are also HFC era transitions
do for [sl in "4492800 16588800 23068800 39916800"] {
  set arrow from sl, screen 0.5 to sl, screen 0.575 nohead ls 28 linewidth 5
  set arrow from sl, screen 0.0 to sl, screen 0.075 nohead ls 28 linewidth 5
}

plot benchmarkLedgerOpsLogFilePath using 1:4 with impulses title "mut",\
     benchmarkLedgerOpsLogFilePath using 1:7 title "forcst",\
     benchmarkLedgerOpsLogFilePath using 1:8 title "hdrTik",\
     benchmarkLedgerOpsLogFilePath using 1:9 title "hdrApp",\
     benchmarkLedgerOpsLogFilePath using 1:10 title "blkTik",\
     benchmarkLedgerOpsLogFilePath using 1:11 title "blkApp" 

unset multiplot

# the first shelley epoch was 208, and it began in slot 4492800
# there are 10k/f = 2160 * 10 * 20 = 432000 slots in each Shelley epoch

# these are the first epoches of each ledger era
# (see https://github.com/input-output-hk/cardano-ledger/wiki/First-Block-of-Each-Era)
# Byron		0
# Shelley	208
# Allegra 	236
# Mary	 	251
# Alonzo 	290
# Alonzo' 	298

# these are the first slots of each Shelley epoch
# 208 4492800
# 209 4924800
# 210 5356800
# 211 5788800
# 212 6220800
# 213 6652800
# 214 7084800
# 215 7516800
# 216 7948800
# 217 8380800
# 218 8812800
# 219 9244800
# 220 9676800
# 221 10108800
# 222 10540800
# 223 10972800
# 224 11404800
# 225 11836800
# 226 12268800
# 227 12700800
# 228 13132800
# 229 13564800
# 230 13996800
# 231 14428800
# 232 14860800
# 233 15292800
# 234 15724800
# 235 16156800
# 236 16588800
# 237 17020800
# 238 17452800
# 239 17884800
# 240 18316800
# 241 18748800
# 242 19180800
# 243 19612800
# 244 20044800
# 245 20476800
# 246 20908800
# 247 21340800
# 248 21772800
# 249 22204800
# 250 22636800
# 251 23068800
# 252 23500800
# 253 23932800
# 254 24364800
# 255 24796800
# 256 25228800
# 257 25660800
# 258 26092800
# 259 26524800
# 260 26956800
# 261 27388800
# 262 27820800
# 263 28252800
# 264 28684800
# 265 29116800
# 266 29548800
# 267 29980800
# 268 30412800
# 269 30844800
# 270 31276800
# 271 31708800
# 272 32140800
# 273 32572800
# 274 33004800
# 275 33436800
# 276 33868800
# 277 34300800
# 278 34732800
# 279 35164800
# 280 35596800
# 281 36028800
# 282 36460800
# 283 36892800
# 284 37324800
# 285 37756800
# 286 38188800
# 287 38620800
# 288 39052800
# 289 39484800
# 290 39916800
# 291 40348800
# 292 40780800
# 293 41212800
# 294 41644800
# 295 42076800
# 296 42508800
# 297 42940800
# 298 43372800
# 299 43804800
# 300 44236800
# 301 44668800
# 302 45100800
# 303 45532800
# 304 45964800
# 305 46396800
# 306 46828800
# 307 47260800
# 308 47692800
# 309 48124800
# 310 48556800
# 311 48988800
# 312 49420800
# 313 49852800
# 314 50284800
# 315 50716800
# 316 51148800
# 317 51580800
# 318 52012800
# 319 52444800
# 320 52876800
# 321 53308800
# 322 53740800
# 323 54172800
# 324 54604800
# 325 55036800
# 326 55468800
# 327 55900800
# 328 56332800
# 329 56764800
# 330 57196800
# 331 57628800
# 332 58060800
# 333 58492800
# 334 58924800
# 335 59356800
# 336 59788800
# 337 60220800
# 338 60652800
# 339 61084800
# 340 61516800
# 341 61948800
# 342 62380800
# 343 62812800
# 344 63244800
# 345 63676800
# 346 64108800
# 347 64540800
# 348 64972800
# 349 65404800
# 350 65836800
# 351 66268800
# 352 66700800
# 353 67132800
# 354 67564800
# 355 67996800
# 356 68428800
# 357 68860800
# 358 69292800
# 359 69724800
# 360 70156800
# 361 70588800
# 362 71020800
# 363 71452800
# 364 71884800
# 365 72316800
# 366 72748800
# 367 73180800
# 368 73612800
# 369 74044800
# 370 74476800
# 371 74908800
# 372 75340800
# 373 75772800
# 374 76204800
# 375 76636800
# 376 77068800
# 377 77500800
# 378 77932800
# 379 78364800
# 380 78796800
# 381 79228800
# 382 79660800
# 383 80092800
# 384 80524800
# 385 80956800
# 386 81388800
# 387 81820800
# 388 82252800
# 389 82684800
# 390 83116800
# 391 83548800
# 392 83980800
# 393 84412800
# 394 84844800
# 395 85276800
# 396 85708800
# 397 86140800
# 398 86572800
# 399 87004800
# 400 87436800

nfrisby avatar Sep 21 '22 23:09 nfrisby

Some observations from these results:

  • We cannot conclude anything about ledger vs consensus overhead from these benchmarks. We are not measuring the consensus overhead. I think measuring this requires more careful analysis. Maybe we should create a new issue.
  • We seem to have a strong indication that HFC does not introduce significant overhead, at least for the operations benchmarked here. Why? Because we do not see a discrete increase in the total time these operations take as we cross the epoch boundaries.

dnadales avatar Oct 03 '22 12:10 dnadales

Possible ways to drive this PR home:

  • [x] Incorporate the Gnuplot scripts into the repository. This requires that we adjust the plot boundaries based on the range of the input data.
  • [x] Improve structure (if applicable).
  • [x] Add more comments.
  • [x] Use more descriptive variable names.
  • [x] Document this feature (eg what the benchmark does, how to run it).
  • [x] Decide where to merge this PR: we should avoid developing on a release branch, however merging this to master will probably render db-analyser unusable since we won't have ledger or consensus version compatible with the Chain DB that was created using a previous version. So we should re-consider moving this tool somewhere else (eg cardano-node or to a separate repository).

Example of a place were we could add more comments:

        -- why do we return this:
        pure $ St rtsStats5 $ ExtLedgerState ldgrSt' headerState'

We might consider adding some ASCII version of these drawings to the documentation

image image

dnadales avatar Oct 03 '22 12:10 dnadales

FWIW, my initial reaction was the same as Arnaud's:

None beside sticking to gnuplot 🙂 Yes the doc is not straightforward but no graphing library in Haskell I am aware of is on par anyway. I would be interested if there were an alternative in Haskell though 🙂 Alternatively, python as some great plotting libraries

I've also had pleasant experiences with some JS libraries (plop your data in a JSON array, copy it in to a HTML file, and boom you have interactive plots).

I would hesitate to commit to any plotting library in Haskell -- I expect it all to inevitably break 😢.

But that's easy enough: just also commit the gnuplot script as a back-up?

Are you liking the vega interface?

nfrisby avatar Oct 17 '22 17:10 nfrisby

@nfrisby one of the next steps I had in mind was to output each line as a JSON representation of a SlotDataPoint, eg:

data SlotDataPoint =
    SlotDataPoint
      { slot      :: Int
      , totalTime :: Int
      , mut       :: Int
      , gc        :: Int
       -- ... 
      }

and this is the format that the new plotting and analyzing tool will ingest. In this way, we make sure that if we change the output data in the db-analyser we will detect this in the tools that ingest this data. We can convert this data back to a CSV, but as a separate step (for instance when converting the file to something gnuplot can understand).

dnadales avatar Oct 18 '22 16:10 dnadales

I've also had pleasant experiences with some JS libraries (plop your data in a JSON array, copy it in to a HTML file, and boom you have interactive plots).

Yeah, we could ultimately do that with vega-lite.

I would hesitate to commit to any plotting library in Haskell -- I expect it all to inevitably break 😢.

What would "breaking" mean in this case? If for instance version 4, the currently supported version of vega-lite, is not distributed anymore by the CDN? In that case this could be a problem indeed. We could then spit out a vega-lite conforming JSON directly.

But that's easy enough: just also commit the gnuplot script as a back-up?

Sure thing.

Are you liking the vega interface?

I am. I looked at 4 different Haskell plotting libraries, and this is by far the easiest one to learn and use, albeit not being all the type-safe one would like, but I guess that's a trade off for usability?

@nfrisby how should I proceed:

  • Option 0: remove the vega-lite implementation and generate a gnu-plot script (where I would basically accommodate the epoch and era boundaries in the plot to cover only the data being plotted).
  • Option 1: keep the vega-lite implementation, and also generate the gnu-plot script.

dnadales avatar Oct 18 '22 16:10 dnadales

Option 0: remove the vega-lite implementation and generate a gnu-plot script (where I would basically accommodate the epoch and era boundaries in the plot to cover only the data being plotted).

Why do you have to do that? Can you plot them all (like my gnuplot script does) and then set the x-axis bounds explicitly to automatically crop out the "extra" epoch/era boundaries?

Option 1: keep the vega-lite implementation, and also generate the gnu-plot script.

Why generate the gnuplot script? Can't a static script suffice? Or maybe one that takes gnuplot parameters (you can set variable names on the command-line invocation)?


I don't have a preference.

Tangent: The JSON format is a lot of per-line overhead, so that'll slow things down for longer runs and swell file sizes. Neither of which really matters, but I personally prefer the X-separated values format for tabular data.

nfrisby avatar Oct 18 '22 17:10 nfrisby

Option 0: remove the vega-lite implementation and generate a gnu-plot script (where I would basically accommodate the epoch and era boundaries in the plot to cover only the data being plotted).

Why do you have to do that? Can you plot them all (like my gnuplot script does) and the set the x-axis bounds explicitly to automatically crop out the "extra" epoch/era boundaries?

Won't you end up with vertical lines outside the x-axis as we can see in some of the plots above? Is there a way to prevent this in gnuplot?

I don't have a preference.

In that case I'm going to keep the vega-lite backend just because it gives us the possibility of zooming in and the plots show a tooltip with the data values when the mouse pointer hovers over the data point.

Tangent: The JSON format is a lot of per-line overhead, so that'll slow things down for longer runs and swell file sizes. Neither of which really matters, but I personally prefer the X-separated values format for tabular data.

Good point. This is the problem I want to solve: I'd like to couple the plotting program with the benchmarking program in such a way that if the latter outputs the data in a different format (eg swapping the order of the columns), we can detect this discrepancy at compile time. I guess I could define a Read instance for SlotDataPoint, and in the benchmarking program collect the benchmarking data for a point as a SlotDataPoint value, and output it as a single line (as it is now).

dnadales avatar Oct 19 '22 16:10 dnadales

Related PR that removes the forecasting step in the forging logic input-output-hk/ouroboros-network#4099

jasagredo avatar Oct 24 '22 10:10 jasagredo

I ditched the vega-lite integration, maybe we're never gonna need it after all, and if so I can bring it back

dnadales avatar Oct 26 '22 09:10 dnadales

So I think to merge this PR we need:

  • [ ] An approval from @nfrisby
  • [x] Decide where to merge it. Just rebase on top of master?
  • [ ] Tidy up the commit history.
  • [ ] Make CI pass.

I don't think we're blocked by https://github.com/input-output-hk/ouroboros-network/pull/4028, but the benchmarks introduced in this PR will show different results from what we'd get applying the patch above, in particular: the cost of forecast.

dnadales avatar Oct 26 '22 09:10 dnadales

I'm anticipating that rebasing will let us drop the Use a relative path for era config files commit, since I'm pretty sure Karg already did that on master.

nfrisby avatar Oct 26 '22 17:10 nfrisby

Won't you end up with vertical lines outside the x-axis as we can see in some of the plots above? Is there a way to prevent this in gnuplot?

The gp script does. But i) I don't think it's a problem worth fixing and ii) if we did the math to plot the marks as a normal plot instead of as "arrow"s, then it would be cropped away by explicitly setting the x-axis. But that math (or maybe a condition in the loop that skips before xlo and after xhi) is just the kind of complexity that felt excessive 🤷

nfrisby avatar Oct 26 '22 17:10 nfrisby

This is the problem I want to solve: I'd like to couple the plotting program with the benchmarking program in such a way that if the latter outputs the data in a different format (eg swapping the order of the columns), we can detect this discrepancy at compile time

Yeah, I also feel that. Since this is a diagnostic, I think at run-time instead of at compile-time would be fine too (eg CI should fail even everything built).

.... aha! We should be using column() in the gnuplot scripts instead of the column numbers. I'm surprised I didn't know about this already (maybe I had forgotten it)? https://stackoverflow.com/a/70279186

nfrisby avatar Oct 26 '22 18:10 nfrisby

@nfrisby and I agree it's best to merge this into master, so I'll do that once I get approval from @nfrisby

dnadales avatar Nov 15 '22 10:11 dnadales

bors r+

dnadales avatar Dec 13 '22 17:12 dnadales

Timed out.

iohk-bors[bot] avatar Dec 13 '22 19:12 iohk-bors[bot]