linux-insides
linux-insides copied to clipboard
Why the kernel needs the pagetable `early_dynamic_pgts`?
Hi,
I have finished Kernel initialization Part 1, but I still have some questions. Could you please give me some hints? Many Thanks.
In arch/x86/kernel/head_64.S
, several pagetables are defined. After reading this part, I think the early paging is handled by 3 tables:
(PGD)early_level4_pgt -> (PUD)level3_kernel_pgt -> (PMD)level2_kernel_pgt
The PMD table level2_kernel_pgt
is filled with 256 entries, so it can map 512MB pyhsical space [0, 512MB)
.
If a virtual address is 0xffffffff81000000
, these pagetables can map it to physical address 0x1000000
. This is very straightforward. (I hope my understanding is correct)
However, I noticed that from the label early_dynamic_pgts
there are 2 tables are also filled. I think they are PUD
and PMD
too and are used to map the kernel from _text
to _end
. I don't know why these two tables are needed. After all, we already have three tables which can map 512MB physical space.
On x86_64:
At the early (not very first) stages, early_dynamic_pgts is used as PUD (first 512 entries) and PMD (remaining entries) for mapping __PAGE_OFFSET:
va = ffff880000000000, mode = ia32e, 2M page
entry | shift | size | offset | decimal |
---|---|---|---|---|
pgof | 0 | 0x200000 | 0x0 | 0 |
L2(pmd) | 21 | 0x200 | 0x0 | 0 |
L3(pud) | 30 | 0x200 | 0x0 | 0 |
L4(pgd) | 39 | 0x200 | 0x110 | 272 |
So __PAGE_OFFSET is mapped with early_top_pgt[272], which points to early_dynamic_pgts. If you debug with gdb you can verify this by:
(gdb) x/zg &early_top_pgt[272]
This entry should point to early_dynamic_pgts. And if you follow the paging mechnism you'll get to PMD level mapping the 2M pages.
The kernel code is mapped through:
va = ffffffff80000000, mode = ia32e, 2M page
entry | shift | size | offset | decimal |
---|---|---|---|---|
pgof | 0 | 0x200000 | 0x0 | 0 |
L2(pmd) | 21 | 0x200 | 0x0 | 0 |
L3(pud) | 30 | 0x200 | 0x1FE | 510 |
L4(pgd) | 39 | 0x200 | 0x1FF | 511 |
Verify:
(gdb) x/zg &early_top_pgt[511]
This should point to level3_kernel_pgt, and
(gdb) x/zg &level3_kernel_pgt[510]
should point to level2_kernel_pgt. This is the 2M PMD pages.
@danix800
Thanks for your reply, but you may misunderstand my question.
early_dynamic_pgts
is used to map __PAGE_OFFSET only after the early page fault handler is set. What I want to know is the identity mapping
.
early_level4_pgt
is renamed toearly_top_pgt
in latest kernel, but I will still use the former to illustrate.
In Identity mapping setup
, the kernel uses the first two entries in early_level4_pgt
and use two tables starting from early_dynamic_pgts
as PUD and PMD. As a result, these three tables map the kernel from _text
to _end
.
+------------+ _end
| |
| |
| |
| kernel |
| text |
---+--------------+ | |
| | | |
| | | |
+--------------+ | |
PUD | entry 8 +-------------------> +------------+ _text
+--------------+
| |
| |
------------------+
| |
| |
| |
PMD | |
| |
+--------------+
| entry 0 |
early_dynamic_pgts+---------------------+
| |
| |
| |
PGD | |
+--------------+
| entry 0 |
early_level4_pgt+------+--------------+
I don't know why this mapping is needed. After deleting these code
and recompile my kernel, everything is ok. I can still boot my system normally.
This identity mapping is for page table switching. If no such mapping the IP will be invalid after setting cr3. Can you verify your test and feedback to us?
Hi, @danix800 Thank you very much! I didn't realize the running of the following two instructions needs a temporary mapping.
/* Ensure I am executing from virtual addresses */
movq $1f, %rax
jmp *%rax
Thanks for your help! I have understood why this mapping is necessary.
I have debugged my kernel step by step in Bochs and have found a strange behavior.
As I said above, I delete these code and recompiled my kernel and run it in Bochs. After cr3
being set to point to early_level4_pgt
, Bochs prompts me that it can't display the physical address of the above two instructions because the page tables(ie. PUD, PMD) don't exist.
[333497746] ??? (physical address not available)
I ignore these warnings and make the kernel continue running the code. I find the kernel can reach movl $0x80000001, %eax
successfully.
The following code is copied from here.
/* Setup early boot stage 4 level pagetables. */
addq phys_base(%rip), %rax
movq %rax, %cr3 /* pagetable switching */
/* Ensure I am executing from virtual addresses */
movq $1f, %rax /* Bochs prompts: physical address not available */
jmp *%rax /* Bochs prompts: physical address not available */
1:
/* Check if nx is implemented */
reach-> movl $0x80000001, %eax /* Bochs can reach here successfully!!! Everything is OK! */
cpuid
movl %edx,%edi
I guess that the Bochs detects the error and continue fetching instructions from physical memory even though it doesn't know what would happen. I have tested my kernel with VMware and QEMU, the former can also boot successfully but QEMU can't. I think this behavior may relate to CPU.
I'm investigating this too. For QEMU when kvm is enabled (--enable-kvm) the kernel can boot also. So I think there's some page fault handling under the hood by KVM.
arch/x86/kvm/mmu.c
has page fault handling, that might be where the real magic happens. I'm not sure.
My Bochs and VMware don't have any KVM mechanism. Things get a little more interesting.
Hi, @danix800 I accidentally saw your question on StackOverflow. I also sent an email to the linux-mm mailing list, but nobody replied me.
Yes nobody seems to be interested. I think it's all on us now. Currently I'm studying GRUB, I'll dig into it when I have time.
I actually digged a little a few days ago, I've already setup a debugging environment and can break the KVM code on the exact faulting instruction.
But without deep understanding of KVM it's difficult to unearth all what's going on so I gave up for now.
On qemu-devel list nobody replies. On linux-kernel list, here, also there's no useful info available.
Happy debugging!
I will also keep watching this question and hope that we can solve it in the future. :smiley:
Years since last comments, I'm also running into here :-)
I think the behavior is related to TLB, as linux-kernel list indicates.
I made some test on kernel v6.2 source with qemu. The related code including:
no -enable-kvm | -enable-kvm | |
---|---|---|
delete 1 | boot fail | boot fail |
delete 1 & 2 | boot fail | boot success |
In -enable-kvm
case, If we don't setup the identity mappping, and don't flush TLB, kernel boot successfully. If we flush TLB, kernel boot failed. It make sense if TLB caches the identity mapping page table, no page fault. And if TLB is flushed, page fault occurs.
If no -enable-kvm
, my guess is qemu don't emulate TLB the same as hardware TLB. As a result, page fault always occurs.
However, this is more of a obscure guess than a solid proof.