op-build icon indicating copy to clipboard operation
op-build copied to clipboard

Petitboot in 60 seconds

Open ghost opened this issue 6 years ago • 6 comments

As an initial goal in reducing boot time, let's aim for power on to petitboot UI up in <60 seconds.

The aim is for this to be for a two socket system without tooo much memory.

The casual agreement has been to split this roughly equally between hostboot and OPAL for timing.

ghost avatar Jun 06 '18 04:06 ghost

Last I heard it was taking many seconds after the poweron command for the SBE to even get poked by the BMC. Who's quota is that taken from? ;-)

dcrowell77 avatar Jun 06 '18 19:06 dcrowell77

Dan [email protected] writes:

Last I heard it was taking many seconds after the poweron command for the SBE to even get poked by the BMC. Who's quota is that taken from? ;-)

Only Hostboot and OPAL get measurable time, everyone else has to be close enough to 0 seconds :)

Or, we could have a boot time trading scheme, where we sell seconds of boot time to the highest bidder :)

-- Stewart Smith OPAL Architect, IBM.

ghost avatar Jun 07 '18 00:06 ghost

Giving this a gentle bump. We're better than before but still quite some ways away from 60 seconds from the power button being hit (with the new ColdFire FSI driver, I'm going to recommend we include the BMC-side platform twiddling in the 60 seconds).

A quick eyeball look over the IPL process shows a few isteps that are candidates for optimization. Many of the early routines (mss_config etc.) take significant time, while others (centaur configuration) should be able to be bypassed entirely on at least the direct attach platforms.

madscientist159 avatar Jun 11 '19 19:06 madscientist159

Goals are great but they don't really get us anywhere... Some specific thoughts:

  • The centaur steps you mention are essentially skipped, there is no centaur logic running other than a list through an empty list.
  • Remember that until istep15, we are constrained to only 8MB of memory, this causes us to do a considerable amount of code swapping, which is bad because...
  • Pnor access was the biggest piece of time the last time I heard. This actually got a little worse (I think) with the hiomap changes. There are some very interesting proposals to optimize some of the pnor usage patterns but none of them are simple.
  • Before anyone suggests it, running istep logic in parallel can help some but in our experience we very quickly hit a wall where we actually get slower when we try to do too much parallelization due to memory thrashing.
  • The boot time is very configuration dependent, can you post the console output from your box that shows the long isteps. I didn't think mss_eff_config was one of the longer ones.

dcrowell77 avatar Jun 11 '19 21:06 dcrowell77

Also this presentation from @stewart-ibm from linux.conf.au 2019

mikey avatar Jun 12 '19 00:06 mikey

Dan [email protected] writes:

Goals are great but they don't really get us anywhere... Some specific thoughts:

  • The centaur steps you mention are essentially skipped, there is no centaur logic running other than a list through an empty list.
  • Remember that until istep15, we are constrained to only 8MB of memory, this causes us to do a considerable amount of code swapping, which is bad because...
  • Pnor access was the biggest piece of time the last time I heard. This actually got a little worse (I think) with the hiomap changes. There are some very interesting proposals to optimize some of the pnor usage patterns but none of them are simple.

Yeah, it's pretty bad.

https://github.com/stewart-ibm/hostboot/commit/40214e00d98fc4ba45d5067a45f68652ad4fbf68 will address part of it, and gain around 6 seconds, and I need to write a subsequent patch that instead of reading whole 4k blocks from PNOR, only reads the size of the compressed page from PNOR, and that should save us a bunch more.

-- Stewart Smith OPAL Architect, IBM.

ghost avatar Jun 12 '19 01:06 ghost