blog Linux 内核视角下的进程

进程调度通过在进程之间共享 CPU 时间，达到并行执行的错觉。

下面是 task_struct 结构中和调度相关的几个成员。

// tags/v4.18 - include/linux/sched.h
struct task_struct {
...
	int				prio
	int				static_prio;
	int				normal_prio;
	unsigned int			rt_priority;

	const struct sched_class	*sched_class;
	struct sched_entity		se;
	struct sched_rt_entity		rt;
...

	unsigned int			policy;
...
}

prio、static_prio、normal_prio 和 rt_priority 表示进程优先级；
sched_class 指向进程所属的调度器类；
se 和 rt 分别是普通进程和实时进程的可调度实体，内嵌在 task_struct 结构里面，调度器作用于一个可调度实体，而不是直接操作进程；
policy 保存了对该进程应用的调度策略。

Linux 将进程分为普通进程和实时进程两大类，根据分类的不同，采取不同的调度策略和调度器类。调度策略是决定调度程序在何时让什么进程执行的算法，而调度器类可以理解为实现了调度策略的执行逻辑。

Jun 18 '20 13:06 junnplus

进程优先级

Linux 采用了两种不同的优先级范围。第一种是用 nice 值，它的范围是从 -20 到 +19，默认值为 0；越大的 nice 值意味着更低的优先级。第二种范围是实时优先级，其值是可配置的，默认情况下它的变化范围是从 0 到 99。与 nice 值意义相反，越高的实时优先级数值意味着进程优先级越高。任何实时进程的优先级都高于普通进程，也就是说实时优先级和 nice 优先级处于互不相交的两个范畴。 ---《Linux 内核设计与实现》

task_struct 中具体表示优先级的4个值：

static_prio 表示进程的静态优先级，在 Linux 内核中，静态优先级的取值范围 [100, 139] 对应 nice 值 [-20, +19]，它可以用 nice 和 sched_setscheduler 系统调用修改，否则在进程运行期间会一直保持恒定。
normal_priority 表示基于进程的静态优先级和调度策略计算出的普通优先级。因此，即使普通进程和实时进程具有相同的静态优先级，其普通优先级也是不同的。
prio 表示进程的动态优先值。
rt_priority 表示实时进程的优先级，范围是 [0, 99]，值越大，优先级越高。

各个优先级范围定义在 include/linux/sched/prio.h。

// tags/v4.18 - include/linux/sched/prio.h
#define MAX_NICE	19
#define MIN_NICE	-20
#define NICE_WIDTH	(MAX_NICE - MIN_NICE + 1)

/*
 * Priority of a process goes from 0..MAX_PRIO-1, valid RT
 * priority is 0..MAX_RT_PRIO-1, and SCHED_NORMAL/SCHED_BATCH
 * tasks are in the range MAX_RT_PRIO..MAX_PRIO-1. Priority
 * values are inverted: lower p->prio value means higher priority.
 *
 * The MAX_USER_RT_PRIO value allows the actual maximum
 * RT priority to be separate from the value exported to
 * user-space.  This allows kernel threads to set their
 * priority to a value higher than any user task. Note:
 * MAX_RT_PRIO must not be smaller than MAX_USER_RT_PRIO.
 */

#define MAX_USER_RT_PRIO	100
#define MAX_RT_PRIO		MAX_USER_RT_PRIO

#define MAX_PRIO		(MAX_RT_PRIO + NICE_WIDTH)
#define DEFAULT_PRIO		(MAX_RT_PRIO + NICE_WIDTH / 2)

Linux 内核内部使用 0~139 的数值范围表示内部优先级，MAX_RT_PRIO 指定实时进程的最大优先级，而 MAX_PRIO 则是普通进程的最大优先值数值。内部表示的优先级数值中，数值越小，优先级越大。

0                                                 99     NICE_WIDTH   139
+---------------------------------------------------+-------------------+
|                                                   |                   |
+---------------------------------------------------+---------+---------+
                                                MAX_RT_PRIO   ^       MAX_PRIO
                                                   100        |        140
                                                         DEFAULT_PRIO
                                                             120

计算普通优先级：

// tags/v4.18 - kernel/sched/core.c
/*
 * Calculate the expected normal priority: i.e. priority
 * without taking RT-inheritance into account. Might be
 * boosted by interactivity modifiers. Changes upon fork,
 * setprio syscalls, and whenever the interactivity
 * estimator recalculates.
 */
static inline int normal_prio(struct task_struct *p)
{
	int prio;

	if (task_has_dl_policy(p))
		prio = MAX_DL_PRIO-1;
	else if (task_has_rt_policy(p))
		prio = MAX_RT_PRIO-1 - p->rt_priority;
	else
		prio = __normal_prio(p);
	return prio;
}

/*
 * __normal_prio - return the priority that is based on the static prio
 */
static inline int __normal_prio(struct task_struct *p)
{
	return p->static_prio;
}

普通优先级需要根据普通进程和实时进程进行不同的计算。

普通进程的普通优先级和静态优先级一致，通过 __normal_prio 函数实现。
实时进程中还区分是否是 SCHED_DEADLINE 策略的实时进程，非 SCHED_DEADLINE 策略的实时进程需要根据其 rt_priority 设置。由于更高的 rt_priority 值表示更高的实时优先级，内核内部优先级的表示刚好相反，越低的值表示的优先级越高。SCHED_DEADLINE 策略的进程普通优先值等于 MAX_DL_PRIO-1，SCHED_DEADLINE 策略的进程优先级最高，所以 MAX_DL_PRIO 被定义为 0。

// tags/v4.18 - include/linux/sched/deadline.h
/*
 * SCHED_DEADLINE tasks has negative priorities, reflecting
 * the fact that any of them has higher prio than RT and
 * NORMAL/BATCH tasks.
 */

#define MAX_DL_PRIO		0

Jun 18 '20 16:06 junnplus

调度策略

我们来看下 Linux 中提供了以下几种调度策略：

// tags/v4.18 - include/uapi/linux/sched.h
/*
 * Scheduling policies
 */
#define SCHED_NORMAL		0
#define SCHED_FIFO		1
#define SCHED_RR		2
#define SCHED_BATCH		3
/* SCHED_ISO: reserved but not implemented yet */
#define SCHED_IDLE		5
#define SCHED_DEADLINE		6

SCHED_NORMAL、SCHED_BATCH 和 SCHED_IDLE 都用于普通进程的调度策略，Linux 在 2.6.23 内核版本之后使用完全公平调度算法（CFS）来处理普通进程的调度。SCHED_BATCH 调度策略主要用于非交互、CPU使用密集的批处理进程，CFS 对于这类进程不会采用抢占式的调度，因此不会干扰交互式进程。SCHED_IDLE 优先级最低的调度策略，只有在 CPU 空闲的时候才会被调度 idle 进程。
SCHED_FIFO 和 SCHED_RR 用于实时进程的调度策略，SCHED_FIFO 基于先进先出机制，SCHED_RR 则是基于时间片的轮询方式调度，采取 SCHED_FIFO 策略的进程会一直执行到更高优先级的实时进程来抢占。
SCHED_DEADLINE 是 3.14 内核版本新支持的实时进程调度策略，基于 EDS（Earliest deadline first）算法调度实时进程，针对突发型计算，且适用于延迟和完成时间高度敏感的任务。

Jun 18 '20 16:06 junnplus

调度器类

Linux 中以模块的方式提供调度器类（scheduler class），这样不同类型的进程可以有针对的选择调度算法/策略。

对于各个调度器类，都必须是实现 sched_class 的一个实例。

// tags/v4.18 - kernel/sched/sched.h
struct sched_class {
	const struct sched_class *next;

	void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
	void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
	void (*yield_task)   (struct rq *rq);
	bool (*yield_to_task)(struct rq *rq, struct task_struct *p, bool preempt);

	void (*check_preempt_curr)(struct rq *rq, struct task_struct *p, int flags);

	/*
	 * It is the responsibility of the pick_next_task() method that will
	 * return the next task to call put_prev_task() on the @prev task or
	 * something equivalent.
	 *
	 * May return RETRY_TASK when it finds a higher prio class has runnable
	 * tasks.
	 */
	struct task_struct * (*pick_next_task)(struct rq *rq,
					       struct task_struct *prev,
					       struct rq_flags *rf);
	void (*put_prev_task)(struct rq *rq, struct task_struct *p);
...

	void (*set_curr_task)(struct rq *rq);
	void (*task_tick)(struct rq *rq, struct task_struct *p, int queued);
	void (*task_fork)(struct task_struct *p);
	void (*task_dead)(struct task_struct *p);
...
}

下面是各个调度器类提供的操作：

enqueue_task 向就绪队列添加一个进程，在进程从睡眠状态变为可运行状态时调用。
dequeue_task 将一个进程从就绪队列移出。
yield_task 进程放弃处理器时调用。
check_preempt_curr 用一个新唤醒的进程来抢占当前进程。
pick_next_task 选择下一个将要运行的进程。
put_prev_task 用另一个进程替换当前运行的进程前调用，不等价于 dequeue_task，而是提供一个时机，用于时间记账。
set_curr_task 进程的调度策略发生变化时调用。
task_tick 周期性调度器激活时调用。
task_fork 进程 fork 时通知调度器。
task_dead 进程死亡时通知调度器。

Linux 中实现了多种调度器类，调度器类是有优先级的，内核中所有存在的调度类都按照它们的优先级放在一个链表上。结构体中的 next 指针用来指向下一较低优先级的调度器类。

// tags/v4.18 - kernel/sched/sched.h
#ifdef CONFIG_SMP
#define sched_class_highest (&stop_sched_class)
#else
#define for_each_class(class) \
   for (class = sched_class_highest; class; class = class->next)

extern const struct sched_class stop_sched_class;
extern const struct sched_class dl_sched_class;
extern const struct sched_class rt_sched_class;
extern const struct sched_class fair_sched_class;
extern const struct sched_class idle_sched_class;

// tags/v4.18 - kernel/sched/stop_task.c
/*
 * Simple, special scheduling class for the per-CPU stop tasks:
 */
const struct sched_class stop_sched_class = {
	.next			= &dl_sched_class,
...
}

// tags/v4.18 - kernel/sched/deadline.c
const struct sched_class dl_sched_class = {
	.next			= &rt_sched_class,
...
}

// tags/v4.18 - kernel/sched/rt.c
const struct sched_class rt_sched_class = {
	.next			= &fair_sched_class,
...
}

// tags/v4.18 - kernel/sched/fair.c
const struct sched_class fair_sched_class = {
	.next			= &idle_sched_class,
...
}

// tags/v4.18 - kernel/sched/idle.c
/*
 * Simple, special scheduling class for the per-CPU idle tasks:
 */
const struct sched_class idle_sched_class = {
	/* .next is NULL */
...
}

stop_sched_class 调度器类用在多处理器系统上，这边暂时不涉及，前面 sched_class 也省略了针对多处理器系统的扩展。调度器类在编译期建立的，优先级也是确定的，实时进程的优先级高于普通进程，deadline 的实时进程优先级高于普通实时进程。

stop_sched_class     dl_sched_class      rt_sched_class     fair_sched_class    idle_sched_class
 +------------+  +-->+------------+  +-->+------------+  +-->+------------+  +-->+------------+
 |    next    +--+   |    next    +--+   |    next    +--+   |    next    +--+   |    next    +--> NULL
 |            |      |            |      |            |      |            |      |            |
 +------------+      +------------+      +------------+      +------------+      +------------+

调度器类对应的调度策略：

dl_sched_class：SCHED_DEADLINE
rt_sched_class：SCHED_FIFO、SCHED_RR
fair_sched_class：SCHED_NORMAL、SCHED_BATCH
idle_sched_class：SCHED_IDLE

Jun 18 '20 16:06 junnplus

调度实体

调度器作用于可调度实体，不限于调度进程。普通进程的一个实体由 sched_entity 结构来表示。

// tags/v4.18 - include/linux/sched.h
struct sched_entity {
	/* For load-balancing: */
	struct load_weight		load;
	unsigned long			runnable_weight;
	struct rb_node			run_node;
	struct list_head		group_node;
	unsigned int			on_rq;

	u64				exec_start;
	u64				sum_exec_runtime;
	u64				vruntime;
	u64				prev_sum_exec_runtime;
...
}

load 指定了负荷权重，决定了各个实体占队列总负荷的比例；
run_node 是红黑树结点，使得实体可以在红黑树上排序，可见普通进程的运行队列是一棵红黑树；
on_rq 表示该实体当前是否在运行队列上接受调度；

负荷权重

// tags/v4.18 - include/linux/sched.h
struct load_weight {
	unsigned long			weight;
	u32				inv_weight;
};

实时进程的实体由 sched_rt_entity 表示：

// tags/v4.18 - include/linux/sched.h
struct sched_rt_entity {
	struct list_head		run_list;
	unsigned long			timeout;
	unsigned long			watchdog_stamp;
	unsigned int			time_slice;
	unsigned short			on_rq;
	unsigned short			on_list;

	struct sched_rt_entity		*back;
...

实时进程的调度实体中 run_list 是一个链表节点，可见实时进程的运行队列是由双向链表组成。

Jun 19 '20 08:06 junnplus

运行队列

sched_class 调度器类提供大多数操作是基于 rq（运行队列 runqueue）实现的，每个 CPU 都有自己的一个主运行队列，数据结构定义在 kernel/sched/sched.h：

// tags/v4.18 - kernel/sched/sched.h
/*
 * This is the main, per-CPU runqueue data structure.
 *
 * Locking rule: those places that want to lock multiple runqueues
 * (such as the load balancing or the thread migration code), lock
 * acquire operations must be ordered by ascending &runqueue.
 */
struct rq {
	/* runqueue lock: */
	raw_spinlock_t		lock;

	/*
	 * nr_running and cpu_load should be in the same cacheline because
	 * remote CPUs use both these fields when doing load calculation.
	 */
	unsigned int		nr_running;
...
	#define CPU_LOAD_IDX_MAX 5
	unsigned long		cpu_load[CPU_LOAD_IDX_MAX];

...
	/* capture load from *all* tasks on this CPU: */
	struct load_weight	load;
	unsigned long		nr_load_updates;
	u64			nr_switches;

	struct cfs_rq		cfs;
	struct rt_rq		rt;
	struct dl_rq		dl;
...
	u64			clock;
	u64			clock_task;
}

lock 锁，用来同步当前 CPU 上的调度操作；
nr_running 表示运行队列上可运行进程的数目；
cpu_load 用于跟踪此前的负荷状态；
load 提供了运行队列当前负荷的度量，是当前 CPU 上所有可运行进程的 load 之和；
nr_load_updates 记录 load 更新次数；
nr_switches 记录上下文切换次数；
cfs、rt 和 dl 是嵌入的子运行队列，分别用于完全公平调度器、实时调度器和 deadline 实时调度器；
clock 表示运行队列自身的时钟；
clock_task 表示运行队列中实际执行 task 的时间。

系统的所有运行队列都在 runqueues 数组中，该数组的每个元素分别对应系统中的一个 CPU，表示每个 CPU 上的运行队列。

// tags/v4.18 - kernel/sched/sched.h
DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);

#define cpu_rq(cpu)		(&per_cpu(runqueues, (cpu)))
#define this_rq()		this_cpu_ptr(&runqueues)
#define task_rq(p)		cpu_rq(task_cpu(p))
#define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
#define raw_rq()		raw_cpu_ptr(&runqueues)

内核提供了像 cpu_rq 这样的宏来操作运行队列。

cfs_rq 是针对完全公平调度器的运行队列：

// tags/v4.18 - kernel/sched/sched.h
/* CFS-related fields in a runqueue */
struct cfs_rq {
	struct load_weight	load;
	unsigned long		runnable_weight;
	unsigned int		nr_running;
	unsigned int		h_nr_running;

	u64			exec_clock;
	u64			min_vruntime;

	struct rb_root_cached	tasks_timeline;

	/*
	 * 'curr' points to currently running entity on this cfs_rq.
	 * It is set to NULL otherwise (i.e when none are currently running).
	 */
	struct sched_entity	*curr;
	struct sched_entity	*next;
	struct sched_entity	*last;
	struct sched_entity	*skip;
...

load 表示 cfs 运行队列的负荷；
nr_running 表示 cfs 运行队列上可运行进程的数目；
min_vruntime 记录了队列上所有进程的最小虚拟运行时间；
tasks_timeline 表示一棵按时间排序的红黑树，管理运行队列上的所有调度实体；
curr、next、last 和 skip 分别指向当前运行的调度实体、下一个可运行的调度实体、运行队列中最后的调度实体和跳过运行的调度实体。

Jun 19 '20 10:06 junnplus

到目前为止，有了调度器类，调度器类操作的运行队列以及可调度实体，那开始调度吧。 Linux 提供两种激活调度的调度器，主调度器和周期性调度器。

周期性调度器

周期性调度器在 scheduler_tick 中实现，内核会按照频率 HZ 自动调用该函数。

// tags/v4.18 - kernel/sched/core.c
/*
 * This function gets called by the timer code, with HZ frequency.
 * We call it with interrupts disabled.
 */
void scheduler_tick(void)
{
	int cpu = smp_processor_id();
	struct rq *rq = cpu_rq(cpu);
	struct task_struct *curr = rq->curr;
...
	update_rq_clock(rq);
	curr->sched_class->task_tick(rq, curr, 0);
	cpu_load_update_active(rq);
...
}

周期性调度函数完成两个任务：

管理内核中与整个系统和各个进程调度相关的统计量；
激活负责当前进程调度类的周期性调度方法。

update_rq_clock 处理运行队列的时钟的更新，本质上就是增加运行队列的当前时钟的时间戳。

// tags/v4.18 - kernel/sched/core.c
void update_rq_clock(struct rq *rq)
{
	s64 delta;
...

	delta = sched_clock_cpu(cpu_of(rq)) - rq->clock;
	if (delta < 0)
		return;
	rq->clock += delta;
	update_rq_clock_task(rq, delta);
}

/*
 * RQ-clock updating methods:
 */

static void update_rq_clock_task(struct rq *rq, s64 delta)
{
...
#ifdef CONFIG_IRQ_TIME_ACCOUNTING
	irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time;

	if (irq_delta > delta)
		irq_delta = delta;

	rq->prev_irq_time += irq_delta;
	delta -= irq_delta;
#endif

	rq->clock_task += delta;
...
}

运行队列有两个时钟，clock_task 是为了更准确的记录进程执行时间。update_rq_clock_task 用来更新 clock_task 时钟，可见进程发生中断的时间不会记录在 clock_task 时钟。

cpu_load_update_active 负责更新运行队列的 cpu_load[] 数组。本质上相当于将数组中先前存储的负荷值向后移动一个位置，把当前运行队列的负荷记入数组的第一个位置。

激活当前进程调度类的周期性调度方法 task_tick 的实现取决于底层的调度器类，通常会检查是否需要被抢占，通过设置TIF_NEED_RESCHED 标志来通知内核在必要的时间由主调度器完成真正的调度工作, 此种做法称之为延迟调度策略。

TIF_NEED_RESCHED

内核在 thread_info 结构的 flag 中设置了一个标识来标志进程是否需要重新调度。

// tags/v4.18 - include/linux/thread_info.h
static inline int test_ti_thread_flag(struct thread_info *ti, int flag)
{
	return test_bit(flag, (unsigned long *)&ti->flags);
}
...
#define test_thread_flag(flag) \
	test_ti_thread_flag(current_thread_info(), flag)

#define tif_need_resched() test_thread_flag(TIF_NEED_RESCHED)

如果设置 TIF_NEED_RESCHED 标志，内核将会在接下来的适当时机调用主调度器来完成调度和抢占的工作。

主调度器

在内核中，如果需要将 CPU 分配给与当前活动进程不同的另一个进程，都会直接调用主调度器函数 schedule。

// tags/v4.18 - kernel/sched/core.c
asmlinkage __visible void __sched schedule(void)
{
	struct task_struct *tsk = current;

	sched_submit_work(tsk);
	do {
		preempt_disable();
		__schedule(false);
		sched_preempt_enable_no_resched();
	} while (need_resched());
}

schedule 函数完成如下工作:

禁用内核抢占，调用 __schedule 完成内核调度；
开启内核抢占，检查当前进程是否设置了重调度标志 TLF_NEDD_RESCHED，开始新一轮调度。

// tags/v4.18 - kernel/sched/core.c
/*
 * __schedule() is the main scheduler function.
 *
 * The main means of driving the scheduler and thus entering this function are:
 *
 *   1. Explicit blocking: mutex, semaphore, waitqueue, etc.
 *
 *   2. TIF_NEED_RESCHED flag is checked on interrupt and userspace return
 *      paths. For example, see arch/x86/entry_64.S.
 *
 *      To drive preemption between tasks, the scheduler sets the flag in timer
 *      interrupt handler scheduler_tick().
 *
 *   3. Wakeups don't really cause entry into schedule(). They add a
 *      task to the run-queue and that's it.
 *
 *      Now, if the new task added to the run-queue preempts the current
 *      task, then the wakeup sets TIF_NEED_RESCHED and schedule() gets
 *      called on the nearest possible occasion:
 *
 *       - If the kernel is preemptible (CONFIG_PREEMPT=y):
 *
 *         - in syscall or exception context, at the next outmost
 *           preempt_enable(). (this might be as soon as the wake_up()'s
 *           spin_unlock()!)
 *
 *         - in IRQ context, return from interrupt-handler to
 *           preemptible context
 *
 *       - If the kernel is not preemptible (CONFIG_PREEMPT is not set)
 *         then at the next:
 *
 *          - cond_resched() call
 *          - explicit schedule() call
 *          - return from syscall or exception to user-space
 *          - return from interrupt-handler to user-space
 *
 * WARNING: must be called with preemption disabled!
 */
static void __sched notrace __schedule(bool preempt)
{
	struct task_struct *prev, *next;
	unsigned long *switch_count;
	struct rq_flags rf;
	struct rq *rq;
	int cpu;

	cpu = smp_processor_id();
	rq = cpu_rq(cpu);
	prev = rq->curr;
...
	update_rq_clock(rq);

	switch_count = &prev->nivcsw;
	if (!preempt && prev->state) {
		if (unlikely(signal_pending_state(prev->state, prev))) {
			prev->state = TASK_RUNNING;
		} else {
			deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
			prev->on_rq = 0;

			if (prev->in_iowait) {
				atomic_inc(&rq->nr_iowait);
				delayacct_blkio_start();
			}

			/*
			 * If a worker went to sleep, notify and ask workqueue
			 * whether it wants to wake up a task to maintain
			 * concurrency.
			 */
			if (prev->flags & PF_WQ_WORKER) {
				struct task_struct *to_wakeup;

				to_wakeup = wq_worker_sleeping(prev);
				if (to_wakeup)
					try_to_wake_up_local(to_wakeup, &rf);
			}
		}
		switch_count = &prev->nvcsw;
	}

	next = pick_next_task(rq, prev, &rf);
	clear_tsk_need_resched(prev);
	clear_preempt_need_resched();

	if (likely(prev != next)) {
		rq->nr_switches++;
		rq->curr = next;
		/*
		 * The membarrier system call requires each architecture
		 * to have a full memory barrier after updating
		 * rq->curr, before returning to user-space.
		 *
		 * Here are the schemes providing that barrier on the
		 * various architectures:
		 * - mm ? switch_mm() : mmdrop() for x86, s390, sparc, PowerPC.
		 *   switch_mm() rely on membarrier_arch_switch_mm() on PowerPC.
		 * - finish_lock_switch() for weakly-ordered
		 *   architectures where spin_unlock is a full barrier,
		 * - switch_to() for arm64 (weakly-ordered, spin_unlock
		 *   is a RELEASE barrier),
		 */
		++*switch_count;

		trace_sched_switch(preempt, prev, next);

		/* Also unlocks the rq: */
		rq = context_switch(rq, prev, next, &rf);
...

Jun 19 '20 13:06 junnplus

blog
blog copied to clipboard

Linux 内核视角下的进程 - 进程调度

进程优先级

调度策略

调度器类

调度实体

负荷权重

运行队列

周期性调度器

TIF_NEED_RESCHED

主调度器

blog blog copied to clipboard

Linux 内核视角下的进程 - 进程调度

进程优先级

调度策略

调度器类

调度实体

负荷权重

运行队列

周期性调度器

TIF_NEED_RESCHED

主调度器

blog
blog copied to clipboard