firecracker icon indicating copy to clipboard operation
firecracker copied to clipboard

CPUID (3/3): CPUID template expansion

Open JonathanWoollett-Light opened this issue 1 year ago • 1 comments

https://github.com/firecracker-microvm/firecracker/pull/3105 must be merged first

Reason for This PR

Many properties of the CPU are not considered when saving and restoring snapshots which could lead to potential instability.

Description of Changes

Expands coverage of CPUID to the full Intel specification for better stability on resuming snapshots.

  • Creating VMs using full CPUID templates: Replacing the old approach of masking the results of get_cpuid2.

  • Validating feature support on snapshot restore: On snapshot restore this validates the full CPUID specification. Certain exclusions are present for complex cases such as cache and memory topology.

  • Efficient serialization/de-serialization of CPUID templates: In attempting to do this most efficiently, we can loose some flexibility e.g. we only support up to a certain number of sub-leaves for a given leaf. This is an unfortunate trade-off, although it offers a massive and I believe necessary performance improvement (from <100ms to <1ms). We can mitigate the risk of security or compatibility issues by implementing generous limits here (this functionality is the static feature under the cpuid crate).

  • Updated templates: Under src/cpuid/templates/ are the c3, t2 and t2s templates. The .json versions are present to offer readable examples. These templates are created with the feature flags static and leaf_18 (which indicate using ArrayVecs over Vecs and including CPUID leaf 18 respectively).

  • [ ] This functionality can be added in rust-vmm.

Future work

There are 3 clear areas where I beleive future work would be useful:

  1. Propagating RawCpuid to rust-vmm.(to replace CpuId/FamWrapper)
  2. Expanding support to cover cache and memory topology.
  3. Considering the areas where checking both MSRs and CPUID in parallel is required for correctness.

License Acceptance

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

PR Checklist

  • [x] All commits in this PR are signed (git commit -s).
  • [x] The issue which led to this PR has a clear conclusion.
  • [x] This PR follows the solution outlined in the related issue.
  • [ ] The description of changes is clear and encompassing.
  • [ ] Any required documentation changes (code and docs) are included in this PR.
  • [ ] Any newly added unsafe code is properly documented.
  • [x] Any API changes follow the Runbook for Firecracker API changes.
  • [x] Any user-facing changes are mentioned in CHANGELOG.md.
  • [ ] All added/changed functionality is tested.

JonathanWoollett-Light avatar Aug 02 '22 18:08 JonathanWoollett-Light

I think this PR can be split into at least 3 PRs each with multiple commits to make it more manageable to review:

  • One PR that refactors error handling and propagation within Firecracker. This can be even opened agains the main branch since its not too coupled to the CPUID crate.
  • One PR adding the bit-fields-macros and bit-fields crates.
  • One PR for the new implementation of the cpuid crate and changes to Firecracker to make it work.

These make sense, but I think the second one might be a bit odd. Given it would be a PR adding libraries that are not being used. It would require some extra description noting it is linked to other PRs.

JonathanWoollett-Light avatar Aug 23 '22 17:08 JonathanWoollett-Light

Some miscellaneous feedback:

  • There is a test build error in the vmm crate that may be related. It'll be worth checking that all double checking the tests once upstream Rust upgrades are complete:
error[E0599]: no variant or associated item named `Leaf0NotFoundInCpuid` found for enum `cpuid::common::Error` in the current scope
   --> src/vmm/tests/integration_tests.rs:373:35
    |
373 |             cpuid::common::Error::Leaf0NotFoundInCpuid
    |                                   ^^^^^^^^^^^^^^^^^^^^ variant or associated item not found in `cpuid::common::Error`

error: aborting due to previous error
  • On an i3.metal (Intel), I am not able to launch a guest VM using only a kernel and rootfs in configuration.
[anonymous-instance:main:ERROR:src/firecracker/src/main.rs:90] Firecracker panicked at 'called `Option::unwrap()` on a `None` value', src/cpuid/src/intel/mod.rs:1072:60
  • On an AMD host specifically, using the kvm_get_supported_cpuid and serializing that Cpuid struct, and then deserializing it results in a serde error (I will attach the JSON file failing deserialization, but at a glance it looks like a null value in a padding field that is tripping up):
Error("expected,or}", line: 11, column: 18))

mattschlebusch avatar Oct 05 '22 08:10 mattschlebusch

  • There is a test build error in the vmm crate that may be related. It'll be worth checking that all double checking the tests once upstream Rust upgrades are complete:
error[E0599]: no variant or associated item named `Leaf0NotFoundInCpuid` found for enum `cpuid::common::Error` in the current scope
   --> src/vmm/tests/integration_tests.rs:373:35
    |
373 |             cpuid::common::Error::Leaf0NotFoundInCpuid
    |                                   ^^^^^^^^^^^^^^^^^^^^ variant or associated item not found in `cpuid::common::Error`

error: aborting due to previous error

This was a quick fix, changing cpuid::common::Error::Leaf0NotFoundInCpuid to cpuid::common::Leaf0NotFoundInCpuid (you can rebase to get this fix).

I'm looking into the other errors.

JonathanWoollett-Light avatar Oct 05 '22 10:10 JonathanWoollett-Light

  • Could the crates construct and construct-macros be made into one crate? Do they have discrete purposes from each other?

Due to how Rust handles procedural macros, they unfortunately need to be in separate crates.

  • cpuid-templates is quite a heavy-handed approach to codifying the hardcoded templates. I don't see a clean way to remove this crate within scope of the PR, so for consistency and sensical sake here are my suggestions:

    • Could the hardcoded templates be split into separate files outside of lib.rs?
    • Ideally the JSON files shouldn't be in the Rust folder structure. Either they should be moved to the resources folder in the firecracker workspace, or deleted. My suggestion here is only for this PR, longer term for these JSON files we should be generating them via the vmm/build.rs path, similar to how the seccomp defaults are being built up, but this is scope creep for this PR and should not be done now.

For the 1st point I could do something like: lib.rs:

lazy_static:lazy_static! {
    static ref C3: Cpuid = include!("c3.in");
    static ref T2: Cpuid = include!("t2.in");
    // ...
}

c3.in etc.:

Cpuid :: Intel (IntelCpuid { leaf_0 : Leaf { eax : 22u32 , ebx : FixedString  ...

On the 2nd point, I like containing this functionality in 1 crate. This means we can move it and delete it as 1 block without needing to move other files in the workspace.

JonathanWoollett-Light avatar Oct 18 '22 09:10 JonathanWoollett-Light

As a manual sanity test on the cpuid-template-expansion branch, I tried to launch a guest configuring the T2S via /machine-config. When I try to start the guest I get a panic as follows:

[anonymous-instance:main:ERROR:src/firecracker/src/main.rs:90] Firecracker panicked at 'called `Option::unwrap()` on a `None` value', src/cpuid/src/intel/mod.rs:907:60

The CI pipeline can be a meaty pipeline to wait for results on. I recommend manually running the CPU feature integ tests (./tools/devtool test -- integration_tests/functional/test_cpu_features.py). The T2S use-case I tried manually is covered in that test suite.

mattschlebusch avatar Oct 26 '22 13:10 mattschlebusch

@bchalios @dianpopa Could you please re-approve (needed to force push to fix formatting).

JonathanWoollett-Light avatar Jan 17 '23 16:01 JonathanWoollett-Light