Blog
Blog copied to clipboard
Chapter 12 Memory Management
Pages
内核将物理页面作为基本内存管理单元。内核中页面由struct page
结构表示,简化后的定义如下
struct page {
unsigned long flags;
atomic_t _count;
atomic_t _mapcount;
unsigned long private;
struct address_space *mapping;
pgoff_t index;
struct list_head lru;
void *virtual;
};
-
flags
域表示页面状态,如是否dirty或者是否Lock. falgs由比特位表示,因此32位机上至少有32个不同的flags,定义在<linux/>page-flags.h>
-
_count
表示page的usage count,即有页面有多少个引用,如果是-1说明没人使用,那么可以重新分配出去,内核中一般通过page_count()
函数获取引用计数,返回0说明没人使用,返回整数说明页面被使用 -
mapping
域说明这个页面关联了page cache -
private
域私有数据,作为进程页表的映射 -
virtual
表示页面的虚拟地址,对于高端内存,这个字段是空的,需要动态映射。
系统使用page
来追踪所有的页面,因为系统需要知道页面是否可以分配,如果不能分配还需要知道谁在使用它。
Zones
为什么会有Zone的概念?
因为一些硬件特性,比如某些DMA设备只能访问某部分内存;Linux内核就将相似属性的页面划分为Zone。Linux主要有四个zone,定义在<linux/mmzone.h>
- ZONE_DMA—This zone contains pages that can undergo DMA
- ZONE_DMA32—Like ZOME_DMA, this zone contains pages that can undergo DMA. Unlike ZONE_DMA, these pages are accessible only by 32-bit devices.
- ZONE_NORMAL—This zone contains normal, regularly mapped, pages.
- ZONE_HIGHMEM—This zone contains “high memory,” which are pages not permanently mapped into the kernel’s address space.
实际的内存布局是架构相关的,比如一些架构就没有ZONE_DMA,ZONE_HIGHMEM
主要是32位机器上内核无法直接映射的部分,64位上没有这个zone。注意,内核分区是软件上的概念,和硬件没有关系。DMA能访问的内存必须从ZONE_DMA申请,而ZONE_NORMAL即可以冲DMA zone申请也可以从normal zone申请,但是不能同时申请,即不能跨区域申请,正常情况下不会从dma zone申请,但当内存紧张的时候也有可能从dma zone申请。
内核中zone由struct zone
表示,定义在<linux/mmzone.h>
struct zone {
unsigned long watermark[NR_WMARK];
unsigned long lowmem_reserve[MAX_NR_ZONES];
struct per_cpu_pageset pageset[NR_CPUS];
spinlock_t lock;
struct free_area free_area[MAX_ORDER]
spinlock_t lru_lock;
struct zone_lru {
struct list_head list;
unsigned long nr_saved_scan;
} lru[NR_LRU_LISTS];
struct zone_reclaim_stat reclaim_stat;
unsigned long pages_scanned;
unsigned long flags;
atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS];
int prev_priority;
unsigned int inactive_ratio;
wait_queue_head_t *wait_table;
unsigned long wait_table_hash_nr_entries;
unsigned long wait_table_bits;
struct pglist_data *zone_pgdat;
unsigned long zone_start_pfn;
unsigned long spanned_pages;
unsigned long present_pages;
const char *name;
};
-
lock
域保护这个结构并发访问,注意它保护的仅仅是结构本身,而不是zone里面所有页面 -
watermark
持有最低,低,高水位;内核使用水位来设置合适的每个区域内存消耗的基准,随着水位相对于可用内存的变化而改变其积极性 -
name
zone名称,在mm/page_alloc.c
赋予DMA, Normal,HighMem名称
分配页面
所有的页面分配函数都定义在<linux/gfp.h>
,核心函数是
struct page * alloc_pages(gfp_t gfp_mask, unsigned int order)
这个函数分配1<<order个连续物理页面,返回第一个页面的指针。 将给定页面转换为它的逻辑地址
void * page_address(struct page *page)
可以使用__get_free_pages
函数来直接分配页面同时获得逻辑地址
unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order)
如果只需要分配一个页面,可以使用下面两个函数
struct page * alloc_page(gfp_t gfp_mask)
unsigned long __get_free_page(gfp_t gfp_mask)
如果需要分配0页,可以使用
unsigned long get_zeroed_page(unsigned int gfp_mask)
分配页面接口
释放页面
下面一组函数用来释放页面
void __free_pages(struct page *page, unsigned int order)
void free_pages(unsigned long addr, unsigned int order)
void free_page(unsigned long addr)
需要注意的是释放的一定是自己分配的,传递错误的参数会导致corruption。
上面的页面分配函数在对以页面为大小的需求是很好用的,但对一些byte-sized大小的内存分配,kmalloc()
更合适。
kmalloc()
对于字节单位的内存分配,kmalloc更加适合,函数原型如下。
#include <linux/slab.h>
void *kmalloc(size_t size, gfp_t flags)
kmalloc分配的内存是物理连续的,使用示例
struct dog *p;
p = kmalloc(sizeof(struct dog), GFP_KERNEL);
if (!p)
/* handle error ... */
gfp_mask Flags
分配flag由gfp_t
表示,定义在<linux/types.h>
,gfp flags由三部分组成:action/zone/types。
Action modifiers specify how the kernel is supposed to allocate the requested memory. Zone modifiers specify from where to allocate memory. Type flags specify a combination of action and zone modifiers as needed by a certain type of memory allocation. The GFP_KERNEL is a type flag, which is used for code in process context inside the kernel.
action修饰
action修饰符可以结合使用,比如
ptr = kmalloc(size, __GFP_WAIT | __GFP_IO | __GFP_FS);
zone修饰
Zone modifiers specify from which memory zone the allocation should originate.
对于32位存在high mem的情况下,有如下限制
You cannot specify __GFP_HIGHMEM to either __get_free_pages() or kmalloc(). Because these both return a logical address, and not a page structure, it is possible that these functions would allocate memory not currently mapped in the kernel’s virtual address space and, thus, does not have a logical address. Only alloc_pages() can allocate high memory.
type修饰
The type flags specify the required action and zone modifiers to fulfill a particular type of transaction.
type flags如下
-
GFP_KERNEL
是最常用的标记,这个flag只能用在进程上下文,因为它可能引起睡眠(比如内存紧张时需要换成页面到swap分区,flush dirty page等) -
GFP_ATOMIC
不能睡眠,不能进行上面的换成等操作,因此它成功的概率要低一些。它适用于中断处理函数,softirq和tasklets。 -
GFP_DMA
表面只能从ZONE_DMA分配 -
GFP_NOIO
和GFP_NOFS
分配时可能阻塞,但GFP_NOIO
在分配时不会引起任何磁盘IO,GFP_NOFS
可能会启动磁盘 I/O,但不会启动文件系统 I/O
The allocation could result in more filesystem operations, which would then beget other allocations and, thus, more filesystem operations! This could continue indefinitely. Code such as this that invokes the allocator must ensure that the allocator also does not execute it, or else the allocation can create a deadlock.
什么时候使用何种flag?
释放kmalloc分配的页面由
kfree
完成。
void kfree(const void *ptr)
Do not call this function on memory not previously allocated with kmalloc(), or on memory that has already been freed.
vmalloc()
vmalloc()
函数和kmalloc()
函数功能类似,除了它分配虚拟连续的内存。vmalloc
分配的内存物理地址不一定连续。考虑到性能原因,内核里面大多地方都是使用kmalloc
分配内存,因为vmalloc
需要还需要建立页表,这也会造成TLB颠簸。vmalloc
适用于比较大的内存分配,比如内核模块加载的时候。
定义如下
//mm/vmalloc.c
#include <linux/vmalloc.h>
void *vmalloc(unsigned long size)
vmalloc
是可以睡眠的,因此不能用于中断上下文。对于释放函数vfree
void vfree(const void *addr)
Slab 层
内核里面有很多分配结构体大小内存的场景。
The slab layer acts as a generic data structure-caching layer.
slab分配器概念最早出现在SunOS中。slab layer基于以下几个原则:
- Frequently used data structures tend to be allocated and freed often, so cache them
- Frequent allocation and deallocation can result in memory fragmentation
- To prevent this, the cached free lists are arranged contiguously. Because freed data structures return to the free list, there is no resulting fragmentation
- The free list provides improved performance during frequent allocation and deallocation because a freed object can be immediately returned to the next allocation.
- If the allocator is aware of concepts such as object size, page size, and total cache size, it can make more intelligent decisions
- If part of the cache is made per-processor,allocations and frees can be performed without an SMP lock.
- If the allocator is NUMA-aware, it can fulfill allocations from the same memory node as the requestor.
- Stored objects can be colored to prevent multiple objects from mapping to the same cache lines.
linux中的slab layer正是基于以上原则设计的。
slab实现
slab层将不同的objects分成称为cache的组,每个组存储不同类型的object。每个类型的对象都有对应的cache,例如,一个缓存用于进程描述符(task_struct 结构的空闲列表),而另一个缓存用于 inode 对象(struct inode)。
cache然后被分成slabs,slabs由1个或多个物理页面组成,通常是1个页面,每个cache可以有多个slabs。每个slab包含多个objects(对象),这些对象是被缓存的数据结构。
每个slab处于3中状态中的一种,full, partial or empty
。full
slab没有空闲对象,即所有对象都分配出去了;empty
slab没有分配的对象,即所有对象都是空闲的;partial
slab既包含分配也包含未分配的对象。当内核需要分配新的对象,先从partial
分配,然后再是从empty
slab分配,如果还是没有,就创建slab。下图展示了caches, slabs, objects关系。
内核中每个
cache
由kmem_cache
结构表示,这个结构有3个链表,slabs full, slabs partial, and slabs empty
,存放在kmem_list3
成员里。slab
由struct slab
表示
struct slab {
struct list_head list; /* full, partial, or empty list */
unsigned long colouroff; /* offset for the slab coloring */
void *s_mem; /* first object in the slab */
unsigned int inuse; /* allocated objects in the slab */
kmem_bufctl_t free; /* first free object, if any */
};
Slab descriptors are allocated either outside the slab in a general cache or inside the slab itself, at the beginning.The descriptor is stored inside the slab if the total size of the slab is sufficiently small, or if internal slack space is sufficient to hold the descriptor. ???
slab分配器通过__get_free_pages()
创建新的slab.
static void *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, int nodeid)
{
struct page *page;
void *addr;
int i;
flags |= cachep->gfpflags;
if (likely(nodeid == -1)) {
addr = (void*)__get_free_pages(flags, cachep->gfporder);
if (!addr)
return NULL;
page = virt_to_page(addr);
} else { // 对应NUMA系统,在对应节点分配内存有助于提供性能
page = alloc_pages_node(nodeid, flags, cachep->gfporder);
if (!page)
return NULL;
addr = page_address(page);
}
i = (1 << cachep->gfporder);
if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
atomic_add(i, &slab_reclaim_pages);
add_page_state(nr_slab, i);
while (i––) {
SetPageSlab(page);
page++;
}
return addr;
}
The first parameter to this function points to the specific cache that needs more pages.The second parameter points to the flags given to __get_free_pages
cachep->gfporder
表示需要分配的大小。由kmem_freepages()
来释放内存,不过slab 层的重点是避免分配和释放页面。相反,只有在给定缓存中不存在任何partial
或empty
的slab时,slab层才会调用页面分配函数。释放函数仅在可用内存变低并且系统试图释放内存时,或者当cache
被显式销毁时调用。
slab分配器的所有细节都隐藏在接口函数背后,因此对使用者还是很友好的。
slab分配接口
可以通过如下函数创建一个cache
struct kmem_cache * kmem_cache_create(const char *name,
size_t size,
size_t align,
unsigned long flags,
void (*ctor)(void *));
- The first parameter is a string storing the name of the cache
- The second parameter is the size of each element in the cache.
- he third parameter is the offset of the first object within a slab,可以为0
- The flags parameter specifies optional settings controlling the cache’s behavior,可以为0
下面的flags也使用一个或多个:
- SLAB_HWCACHE_ALIGN—This flag instructs the slab layer to align each object within a slab to a cache line.
- SLAB_POISON—This flag causes the slab layer to fill the slab with a known value (a5a5a5a5).This is called poisoning and is useful for catching access to uninitialized memory.
- SLAB_RED_ZONE—This flag causes the slab layer to insert “red zones” around the allocated memory to help detect buffer overruns.
- SLAB_PANIC—This flag causes the slab layer to panic if the allocation fails.This flag is useful when the allocation must not fail, as in, say, allocating the VMA structure cache
- SLAB_CACHE_DMA—This flag instructs the slab layer to allocate each slab in DMAable memory.
最后一个参数ctor
,cache
的构造函数,当新页面被进入时会调用,一般设置为NULL。
kmem_cache_create()
成功返回指向新创建cache
的指针,否则NULL,这个函数不能在中断上下文调用。销毁一个对象,可以调用
int kmem_cache_destroy(struct kmem_cache *cachep)
一般是在模块卸载时调用,这个函数也不能用在中断上下文,调用者调用这个函数前需要有两个前提:
- All slabs in the cache are empty. Indeed, if an object in one of the slabs were still allocated and in use, how could the cache be destroyed?
- No one accesses the cache during (and obviously after) a call to kmem_cache_destroy().The caller must ensure this synchronization.
函数成功返回0,否则返回非0.
从cache分配
cache
创建后,对象可以从cache分配了
void * kmem_cache_alloc(struct kmem_cache *cachep, gfp_t flags)
函数返回一个指向从cache
分配的对象的指针。如果cache中没有slab能分配出对象,slab layer会调用kmem_getpages()
来分配新页,flags
会传递给__get_free_pages()
函数,一般是 GFP_KERNEL or GFP_ATOMIC
。
释放对象并将对象返还给slb,使用
void kmem_cache_free(struct kmem_cache *cachep, void *objp)
使用slab分配器例子
以创建进程时分配进程描述符为例, kernel/fork.c
struct kmem_cache *task_struct_cachep;
task_struct_cachep = kmem_cache_create(“task_struct”,
sizeof(struct task_struct),
ARCH_MIN_TASKALIGN,
SLAB_PANIC | SLAB_NOTRACK,
NULL);
上面函数创建了一个名为task-struct
的cache
,ARCH_MIN_TASKALIGN
一般定义为L1_CACHE_BYTES
,没有构造函数,这里没有检查返回值,因为使用了SLAB_PANIC
,失败的话直接挂掉了。
当创建新进程的时候会调用到do_fork()
->dup_task_struct()
struct task_struct *tsk;
tsk = kmem_cache_alloc(task_struct_cachep, GFP_KERNEL);
if (!tsk)
return NULL;
当进程退出的时候,free_task_struct()
会将task_struct
返回给task_struct_cachep
cache
kmem_cache_free(task_struct_cachep, tsk)
因为进程描述符是系统必须的,所以task_struct_cachep
永远不会销毁,不过可以介绍下。
int err;
err = kmem_cache_destroy(task_struct_cachep);
if (err)
/* error destroying cache */
slab层隐藏了底层的对齐,着色,释放和分配细节。
栈上静态分配
内核栈不像进程栈一样可以动态增长,内核栈一般是很小且固定的。不同架构的内核栈是不同的,一般32位是8K,64位是16K。
单页内核栈
为什么使用一个页面作为进程内核栈?
First, it results in a page with less memory consumption per process. Second and most important is that as uptime increases, it becomes increasingly hard to find two physically contiguous unallocated pages.
在早期中断栈和进程共用内核栈,但当只有一个页面作为内核栈时,中断就不使用进程内核栈了,而是使用自己的中断栈。每个处理器都有一个对应的中断栈,使用一个页面。
在任何内核函数中,应该尽量少使用栈,没有明确规定使用现在,但一般局部变量最大就几百个字节。在栈上分配过大的内存可能造成栈溢出,对大内存的需求建议使用动态分配方式。
High Mem映射
By definition, pages in high memory might not be permanently mapped into the kernel’s address space.Thus, pages obtained via alloc_pages() with the __GFP_HIGHMEM flag might not have a logical address.
永久映射
将给定页面映射到内核空间,使用
#include <linux/highmem.h>
void *kmap(struct page *page)
This function works on either high or low memory. If the page structure belongs to a page in low memory, the page’s virtual address is simply returned. If the page resides in high memory, a permanent mapping is created and the address is returned.The function may sleep, so kmap() works only in process context.
解除映射使用
void kunmap(struct page *page)
临时映射
For times when a mapping must be created but the current context cannot sleep, the kernel provides temporary mappings (which are also called atomic mappings).
建立临时映射
void *kmap_atomic(struct page *page, enum km_type type)
type参数有以下类型
# include <asm-generic/kmap_types.h>
enum km_type {
KM_BOUNCE_READ,
KM_SKB_SUNRPC_DATA,
KM_SKB_DATA_SOFTIRQ,
KM_USER0,
KM_USER1,
KM_BIO_SRC_IRQ,
KM_BIO_DST_IRQ,
KM_PTE0,
KM_PTE1,
KM_PTE2,
KM_IRQ0,
KM_IRQ1,
KM_SOFTIRQ0,
KM_SOFTIRQ1,
KM_SYNC_ICACHE,
KM_SYNC_DCACHE,
KM_UML_USERCOPY,
KM_IRQ_PTE,
KM_NMI,
KM_NMI_PTE,
KM_TYPE_NR
};
这个函数不会睡眠因此可以用于中断上下文,它也会关闭抢占。解除映射
void kunmap_atomic(void *kvaddr, enum km_type type)
函数会打开抢占
Per-CPU 分配
Modern SMP-capable operating systems use per-CPU data—data that is unique to a given processor—extensively.
典型的,per-cpu存放在数组里,数组中的每一项都对应于系统上一个处理器,这是从2.4开始就已经有的机制,使用场景如下
unsigned long my_percpu[NR_CPUS];
// Then you access it as
int cpu;
cpu = get_cpu(); /* get current processor and disable kernel preemption */
my_percpu[cpu]++; /* ... or whatever */
printk(“my_percpu on cpu=%d is %lu\n”, cpu, my_percpu[cpu]);
put_cpu(); /* enable kernel preemption */
如果数据在每个CPU上都是独有且不同的,那么锁是不需要的,内核抢占是per-cpu数据需要关心的唯一问题,主要是下面两个场景
- If your code is preempted and reschedules on another processor, the cpu variable is no longer valid because it points to the wrong processor. (In general, code cannot sleep after obtaining the current processor.)
- If another task preempts your code, it can concurrently access my_percpu on the same processor, which is a race condition.
不过上面的问题通过get_cpu()
可以解决,这个函数会关抢占,对应的put_cpu()
会打开抢占,如果通过smp_processor_id()
来获取当前cpu id,它不会关闭内核抢占。
percpu新接口
2.6内核引入了新的接口,也就是percpu
,上面介绍的方法仍然使用。percpu
出现背景
This new interface, however, grew out of the needs for a simpler and more powerful method for manipulating per-CPU data on large symmetrical multiprocessing computers.
编译时定义
DEFINE_PER_CPU(type, name);
这会在每个CPU上创建一个实力,对应类型和名字。
If you need a declaration of the variable elsewhere, to avoid compile warnings, the following macro is your friend:
DECLARE_PER_CPU(type, name);
可以通过get_cpu_var()
和put_cpu_var()
来操作percpu变量,它们也会开关抢占。
A call to get_cpu_var() returns an lvalue for the given variable on the current processor. It also disables preemption, which put_cpu_var() correspondingly enables.
获取其他CPU对应的percpu数据,可以使用
per_cpu(name, cpu)++; /* increment name on the given processor */
需要注意的是,这个接口不会关抢占也没有任何锁机制,如果其他CPU正在操作这个变量,需要加锁;另外编译时的per-cpu方式是不合适模块的,因为链接器将他们静态链接到.data.percpu
段,如果需要动态创建,使用下面的方法。
运行时percpu变量
内核提供了一组接口可以动态分配percpu变量,类似于kmalloc。
#include <linux/percpu.h>
void *alloc_percpu(type); /* a macro */
void *__alloc_percpu(size_t size, size_t align);
void free_percpu(const void *);
#define alloc_percpu(type) \
(typeof(type) __percpu *)__alloc_percpu(sizeof(type), \
__alignof__(type))
alloc_percpu()
为每个cpu动态分配指定类型的实例,它其实是__alloc_percpu
的封装,alloc_percpu() 宏在字节边界上对齐分配,这是给定类型的自然对齐方式
The alignof construct is a gcc feature that returns the required (or recommended, in the case of weird architectures with no alignment requirements) alignment in bytes for a given type or lvalue.
对应的释放函数是free_percpu()
.
alloc_percpu()
和__alloc_percpu()
返回一个指针,用于间接引用动态创建的 per-CPU 数据,内核提供了两个接口来简化这个操作
get_cpu_var(ptr); /* return a void pointer to this processor’s copy of ptr disable preemption*/
put_cpu_var(ptr); /* done; enable kernel preemption */
使用实例
void *percpu_ptr;
unsigned long *foo;
percpu_ptr = alloc_percpu(unsigned long);
if (!ptr)
/* error allocating memory .. */
foo = get_cpu_var(percpu_ptr);
/* manipulate foo .. */
put_cpu_var(percpu_ptr);
为什么使用percpu变量
The first is the reduction in locking requirements. Depending on the semantics by which processors access the per-CPU data, you might not need any locking at all. Second, per-CPU data greatly reduces cache invalidation. The percpu interface cache-aligns all data to ensure that accessing one processor’s data does not bring in another processor’s data on the same cache line. Consequently, the use of per-CPU data often removes (or at least minimizes) the need for locking. Per-CPU data can safely be used from either interrupt or process context. Note, however, that you cannot sleep in the middle of accessing per-CPU data (or else you might end upon a different processor).