linux-insides Why the kernel needs the pagetable `early_dynamic

Hi,

I have finished Kernel initialization Part 1, but I still have some questions. Could you please give me some hints? Many Thanks.

In arch/x86/kernel/head_64.S, several pagetables are defined. After reading this part, I think the early paging is handled by 3 tables:

(PGD)early_level4_pgt -> (PUD)level3_kernel_pgt -> (PMD)level2_kernel_pgt

The PMD table level2_kernel_pgt is filled with 256 entries, so it can map 512MB pyhsical space [0, 512MB).

If a virtual address is 0xffffffff81000000, these pagetables can map it to physical address 0x1000000. This is very straightforward. (I hope my understanding is correct)

However, I noticed that from the label early_dynamic_pgts there are 2 tables are also filled. I think they are PUD and PMD too and are used to map the kernel from _text to _end. I don't know why these two tables are needed. After all, we already have three tables which can map 512MB physical space.

Nov 13 '17 14:11 hao-lee

On x86_64:

At the early (not very first) stages, early_dynamic_pgts is used as PUD (first 512 entries) and PMD (remaining entries) for mapping __PAGE_OFFSET:

va = ffff880000000000, mode = ia32e, 2M page

entry	shift	size	offset	decimal
pgof	0	0x200000	0x0	0
L2(pmd)	21	0x200	0x0	0
L3(pud)	30	0x200	0x0	0
L4(pgd)	39	0x200	0x110	272

So __PAGE_OFFSET is mapped with early_top_pgt[272], which points to early_dynamic_pgts. If you debug with gdb you can verify this by:

(gdb) x/zg &early_top_pgt[272]

This entry should point to early_dynamic_pgts. And if you follow the paging mechnism you'll get to PMD level mapping the 2M pages.

The kernel code is mapped through:

va = ffffffff80000000, mode = ia32e, 2M page

entry	shift	size	offset	decimal
pgof	0	0x200000	0x0	0
L2(pmd)	21	0x200	0x0	0
L3(pud)	30	0x200	0x1FE	510
L4(pgd)	39	0x200	0x1FF	511

Verify:

(gdb) x/zg &early_top_pgt[511]

This should point to level3_kernel_pgt, and

(gdb) x/zg &level3_kernel_pgt[510]

should point to level2_kernel_pgt. This is the 2M PMD pages.

Nov 15 '17 04:11 danix800

Refer to https://github.com/torvalds/linux/blob/master/Documentation/x86/x86_64/mm.txt

Nov 15 '17 04:11 danix800

@danix800

Thanks for your reply, but you may misunderstand my question.

early_dynamic_pgts is used to map __PAGE_OFFSET only after the early page fault handler is set. What I want to know is the identity mapping.

early_level4_pgt is renamed to early_top_pgt in latest kernel, but I will still use the former to illustrate.

In Identity mapping setup, the kernel uses the first two entries in early_level4_pgt and use two tables starting from early_dynamic_pgts as PUD and PMD. As a result, these three tables map the kernel from _text to _end.

                                                              +------------+ _end
                                                              |            |
                                                              |            |
                                                              |            |
                                                              |  kernel    |
                                                              |  text      |
                      ---+--------------+                     |            |
                         |              |                     |            |
                         |              |                     |            |
                         +--------------+                     |            |
                     PUD |  entry 8     +-------------------> +------------+ _text
                         +--------------+
                         |              |
                         |              |
                      ------------------+
                         |              |
                         |              |
                         |              |
                     PMD |              |
                         |              |
                         +--------------+
                         |   entry 0    |
early_dynamic_pgts+---------------------+
                         |              |
                         |              |
                         |              |
                     PGD |              |
                         +--------------+
                         |   entry 0    |
  early_level4_pgt+------+--------------+

I don't know why this mapping is needed. After deleting these code and recompile my kernel, everything is ok. I can still boot my system normally.

Nov 18 '17 10:11 hao-lee

This identity mapping is for page table switching. If no such mapping the IP will be invalid after setting cr3. Can you verify your test and feedback to us？

Nov 19 '17 05:11 danix800

Hi, @danix800 Thank you very much! I didn't realize the running of the following two instructions needs a temporary mapping.

	/* Ensure I am executing from virtual addresses */
	movq	$1f, %rax
	jmp	*%rax

Thanks for your help! I have understood why this mapping is necessary.

I have debugged my kernel step by step in Bochs and have found a strange behavior. As I said above, I delete these code and recompiled my kernel and run it in Bochs. After cr3 being set to point to early_level4_pgt, Bochs prompts me that it can't display the physical address of the above two instructions because the page tables(ie. PUD, PMD) don't exist.

[333497746] ??? (physical address not available)

I ignore these warnings and make the kernel continue running the code. I find the kernel can reach movl $0x80000001, %eax successfully.

The following code is copied from here.

	/* Setup early boot stage 4 level pagetables. */
	addq	phys_base(%rip), %rax
	movq	%rax, %cr3	/* pagetable switching */

	/* Ensure I am executing from virtual addresses */
	movq	$1f, %rax	/* Bochs prompts: physical address not available */
	jmp	*%rax		/* Bochs prompts: physical address not available */
1:

	/* Check if nx is implemented */
reach->	movl	$0x80000001, %eax	/* Bochs can reach here successfully!!! Everything is OK! */
	cpuid
	movl	%edx,%edi

I guess that the Bochs detects the error and continue fetching instructions from physical memory even though it doesn't know what would happen. I have tested my kernel with VMware and QEMU, the former can also boot successfully but QEMU can't. I think this behavior may relate to CPU.

Nov 19 '17 08:11 hao-lee

I'm investigating this too. For QEMU when kvm is enabled (--enable-kvm) the kernel can boot also. So I think there's some page fault handling under the hood by KVM.

arch/x86/kvm/mmu.c has page fault handling, that might be where the real magic happens. I'm not sure.

Nov 19 '17 08:11 danix800

My Bochs and VMware don't have any KVM mechanism. Things get a little more interesting.

Nov 19 '17 13:11 hao-lee

Hi, @danix800 I accidentally saw your question on StackOverflow. I also sent an email to the linux-mm mailing list, but nobody replied me.

Nov 24 '17 09:11 hao-lee

Yes nobody seems to be interested. I think it's all on us now. Currently I'm studying GRUB, I'll dig into it when I have time.

Nov 25 '17 07:11 danix800

I actually digged a little a few days ago, I've already setup a debugging environment and can break the KVM code on the exact faulting instruction.

But without deep understanding of KVM it's difficult to unearth all what's going on so I gave up for now.

On qemu-devel list nobody replies. On linux-kernel list, here, also there's no useful info available.

Happy debugging!

Nov 25 '17 07:11 danix800

I will also keep watching this question and hope that we can solve it in the future. :smiley:

Nov 25 '17 10:11 hao-lee

Years since last comments, I'm also running into here :-)

I think the behavior is related to TLB, as linux-kernel list indicates.

I made some test on kernel v6.2 source with qemu. The related code including:

	no -enable-kvm	-enable-kvm
delete 1	boot fail	boot fail
delete 1 & 2	boot fail	boot success

In -enable-kvm case, If we don't setup the identity mappping, and don't flush TLB, kernel boot successfully. If we flush TLB, kernel boot failed. It make sense if TLB caches the identity mapping page table, no page fault. And if TLB is flushed, page fault occurs.

If no -enable-kvm, my guess is qemu don't emulate TLB the same as hardware TLB. As a result, page fault always occurs.

However, this is more of a obscure guess than a solid proof.

May 05 '23 12:05 fangzhen

linux-insides
linux-insides copied to clipboard

Why the kernel needs the pagetable `early_dynamic_pgts`?

linux-insides linux-insides copied to clipboard

Why the kernel needs the pagetable `early_dynamic_pgts`?

linux-insides
linux-insides copied to clipboard