GetResources is returning unexpected data when it is executed on VMs with Passthrough CPU
Issue description
In MAAS we are using some utilities from the LXD library to extract the hardware info of a machine (see here). If this is run on a VM with Passthrough CPU it returning that the machine has N sockets where N is the number of CPU threads. The problem is that the socket name changes from time to time (causing MAAS not to display the cpu model name). See this for the full output and this for the output of /proc/cpuinfo. In this specific case it is flipping from Intel(R) Xeon(R) CPU Max 9462 to pc-q35-6.2 (with vendor QEMU).
Is this a known limitation or can this be considered as a bug?
Associated MM thread https://chat.canonical.com/canonical/pl/wzmwbz7ixbbk9rh4847wni7kua
With the pastebin JSON data:
$ jq '.resources.cpu.sockets[] | .name' < /tmp/lxc-info.json
"Intel(R) Xeon(R) CPU Max 9462"
"Intel(R) Xeon(R) CPU Max 9462"
"pc-q35-6.2"
"pc-q35-6.2"
"pc-q35-6.2"
"pc-q35-6.2"
"pc-q35-6.2"
"pc-q35-6.2"
"pc-q35-6.2"
"pc-q35-6.2"
"Intel(R) Xeon(R) CPU Max 9462"
"Intel(R) Xeon(R) CPU Max 9462"
"Intel(R) Xeon(R) CPU Max 9462"
"Intel(R) Xeon(R) CPU Max 9462"
"Intel(R) Xeon(R) CPU Max 9462"
"Intel(R) Xeon(R) CPU Max 9462"
"Intel(R) Xeon(R) CPU Max 9462"
"Intel(R) Xeon(R) CPU Max 9462"
"Intel(R) Xeon(R) CPU Max 9462"
"Intel(R) Xeon(R) CPU Max 9462"
"pc-q35-6.2"
"pc-q35-6.2"
"pc-q35-6.2"
"pc-q35-6.2"
"pc-q35-6.2"
"pc-q35-6.2"
"pc-q35-6.2"
"pc-q35-6.2"
"pc-q35-6.2"
"pc-q35-6.2"
"pc-q35-6.2"
"pc-q35-6.2"
I initially thought it could be a core vs hyperthread issue but the numbers would be equal and they are not:
$ jq '.resources.cpu.sockets[] | .name' < /tmp/lxc-info.json | sort | uniq -c
12 "Intel(R) Xeon(R) CPU Max 9462"
20 "pc-q35-6.2"
Since that's a VM, I'm wondering if the host QEMU process is just not done writing the proper data for all the CPUs when it's first looked up.
@r00ta is there any special timing around the CPU model name flip-flopping? Is that just during boot or also after?
Hi @r00ta,
please can you give me a bit of context in addition to what Simon asked.
Are you using CPU hotplug for that VM? Is this thing is stable-reproducible or not?
GetResources generate it's output based on /proc/cpuinfo and /sys/devices/system/cpu.
See this for the full output and this for the output of /proc/cpuinfo
These two outputs doesn't appear to be collected from the same machine. Even CPUs are different: https://www.intel.com/content/www/us/en/products/sku/232597/intel-xeon-cpu-max-9462-processor-75m-cache-2-70-ghz/specifications.html https://www.intel.com/content/www/us/en/products/sku/192478/intel-xeon-platinum-8280-processor-38-5m-cache-2-70-ghz/specifications.html
I can imagine inconsistency like mixture of pc-q35-6.2 cpu with Intel(R) Xeon(R) CPU Max 9462 but having absolutely another one Intel(R) Xeon(R) Platinum 8280 on the same machine looks suspicious.
If this is run on a VM with Passthrough CPU it returning that the machine has N sockets where N is the number of CPU threads.
This thing is running on OpenStack/libvirt, right? Please, can you:
- share full XML configuration of that VM
- recollect
/proc/cpuinfoandlxc query -X GET /1.0/resources | jq '.cpu'from inside the VM
Hello @mihalicyn Regarding the topology of the setup, these are KVM Virtual machines with passthrough CPU running on top of physical machines.
Full XML output from virsh dumpxml: https://pastebin.ubuntu.com/p/7MpkxBq2T7/
Output from cat /proc/cpuinfo: https://pastebin.ubuntu.com/p/2wKxPHn3BX/
Output from lscpu: https://pastebin.ubuntu.com/p/2kbNbMtHsJ/
Output from lxc query: https://pastebin.ubuntu.com/p/838nZHSKxG/
Machine view from MAAS:
Wow! Thanks, @alanbach ! Now data looks consistent and I think I know what's going on there.
We have this to extract some data about CPU from DMI for some architectures like arm64. And we use it as a fallback codepath in case when we can't get CPU data from /proc/cpuinfo. I still don't get why we go this fallback way, but it's already a step forward.
Part of dmidecode output from inside the VM:
Handle 0x0400, DMI type 4, 48 bytes
Processor Information
Socket Designation: CPU 0
Type: Central Processor
Family: Other
Manufacturer: QEMU
ID: 41 0F A4 00 FF FB 8B 07
Version: pc-q35-8.2
Manufacturer and Version fields are what we get mixed with a real CPU vendor/model from cpuinfo.
Upd: It is not surprising that I can't reproduce it on my own machine as I have all 0 in cat /sys/devices/system/cpu/cpu*/topology/physical_package_id output.
We had a debugging session with Alan and thanks to him now I have some debug logs and understand what's going on there. Our code implicitly relies on the fact that the ordering of cat /sys/devices/system/cpu/cpu*/topology/physical_package_id and blocks in /proc/cpuinfo is the same. And this works perfectly well for a single-processor machine, or even dual-processor machines where you can get physical_package_id either 0 or 1. But inside a VM with a certain configuration you may have, for instance, 32 "physical" processors and all of them will have different physical_package_id. Moreover, it won't be sorted the same way as lines in /proc/cpuinfo. For example:
drwxr-xr-x 9 root root 0 Dec 18 09:27 cpu0
drwxr-xr-x 9 root root 0 Dec 18 09:27 cpu1
drwxr-xr-x 9 root root 0 Dec 18 09:27 cpu10
drwxr-xr-x 9 root root 0 Dec 18 09:27 cpu11
drwxr-xr-x 9 root root 0 Dec 18 09:27 cpu12
drwxr-xr-x 9 root root 0 Dec 18 09:27 cpu13
drwxr-xr-x 9 root root 0 Dec 18 09:27 cpu14
drwxr-xr-x 9 root root 0 Dec 18 09:27 cpu15
drwxr-xr-x 9 root root 0 Dec 18 09:27 cpu2
drwxr-xr-x 9 root root 0 Dec 18 09:27 cpu3
drwxr-xr-x 9 root root 0 Dec 18 09:27 cpu4
drwxr-xr-x 9 root root 0 Dec 18 09:27 cpu5
drwxr-xr-x 9 root root 0 Dec 18 09:27 cpu6
drwxr-xr-x 9 root root 0 Dec 18 09:27 cpu7
drwxr-xr-x 9 root root 0 Dec 18 09:27 cpu8
drwxr-xr-x 9 root root 0 Dec 18 09:27 cpu9
So, when you iterate over it, you get cpu0, then cpu1, then cpu10. Which means that a bunch of lines in cpuinfo would be skipped to find cpu10 but as we use the same scanner object for the entire thing, eventually, when we get to the cpu2 we will fail to find it in /proc/cpuinfo scanner object as we have already skipped it. I'll put up a PR to fix this tomorrow.
amazing, thanks both.