Blog
Blog copied to clipboard
Chapter 11 Timers and Time Management
时间对内核是很重要的,很多功能都由时间驱动,比如均衡调度队列,刷新屏幕等。
内核时间表示
system timer
以预先设定好的频率触发称为tick rate
。内核里面有两种时间,
wall time
和system time
。
Wall time—the actual time of day—is important to user-space applications. The system uptime—the relative time since the system booted—is useful to both kernel-space and user-space.
系统时钟中断会周期性触发,里面完成了:
- Updating the system uptime
- Updating the time of day
- On an SMP system, ensuring that the scheduler runqueues are balanced and, if not, balancing them
- Running any dynamic timers that have expired
- Updating resource usage and processor time statistics
滴答频率:HZ
系统时钟以预先设定好的频率(HZ)在系统启动后运行。不同架构的HZ可能不一样,例如X86上面HZ=100,意味着每一秒有100次时钟中断。 增加滴答频率意味着有更多的中断产生,这带来了一些好处:
- The timer interrupt has a higher resolution and, consequently, all timed events have a higher resolution.
- The accuracy of timed events improves.
- System calls such as poll()and select()that optionally employ a timeout value execute with improved precision.
- Measurements, such as resource usage or the system uptime, are recorded with a finer resolution.
- Process preemption occurs more accurately.
任何事物都有两面性,更大的HZ
也有一些缺点
- 更高的滴答率意味着更频繁的定时器中断,这意味着更高的开销
- 处理器更多的时间执行中断,会引起缓存颠簸
- 更多的功耗
Jiffies
全局变量jiffies
记录了系统从启动开始以来的滴答数。每一次系统时钟中断到来会加1,因此系统运行时间就是jiffies/HZ
,jiffies
定义
extern unsigned long volatile jiffies;
将秒转换为jiffies,seconds * HZ
,将jiffies转换为秒,jiffies/HZ
,使用示例
unsigned long time_stamp = jiffies; /* now */
unsigned long next_tick = jiffies + 1; /* one tick from now */
unsigned long later = jiffies + 5*HZ; /* five seconds from now */
unsigned long fraction = jiffies + HZ / 10; /* a tenth of a second from now */
jiffies溢出问题
先看如下代码
unsigned long timeout = jiffies + HZ/2; /* timeout in 0.5s */
/* do some work ... */
/* then see whether we took too long */
if (timeout > jiffies) {
/* we did not time out, good ... */
} else {
/* we timed out, error ... */
}
如果直接将timeout
和jiffies
比较,当jiffies发生溢出的时候,jiffies变成0,实际上此时jiffies应该是一个很大的数,导致第1个if条件就会判断错误。
Thankfully, the kernel provides four macros for comparing tick counts that correctly handle wraparound in the tick count.They are in <linux/jiffies.h>.
#define time_after(unknown, known) ((long)(known) - (long)(unknown) < 0)
#define time_before(unknown, known) ((long)(unknown) - (long)(known) < 0)
#define time_after_eq(unknown, known) ((long)(unknown) - (long)(known) >= 0)
#define time_before_eq(unknown, known) ((long)(known) - (long)(unknown) >= 0)
The time_after(unknown, known) macro returns true if time unknown is after time known; otherwise, it returns false.The time_before(unknown, known) macro returns true if time unknown is before time known; otherwise, it returns false.
使用上面宏优化后的代码
unsigned long timeout = jiffies + HZ/2; /* timeout in 0.5s */
/* ... */
if (time_before(jiffies, timeout)) {
/* we did not time out, good ... */
} else {
/* we timed out, error ... */
}
RTC
On boot, the kernel reads the RTC and uses it to initialize the wall time, which is stored in the xtime variable.
时钟中断处理函数
系统时钟中断分为两个部分:结构相关的和架构独立部分。 大多数架构的中断函数会在如下事情:
- Obtain the xtime_lock lock, which protects access to jiffies_64 and the wall time value, xtime.
- Acknowledge or reset the system timer as required.
- Periodically save the updated wall time to the real time clock
- Call the architecture-independent timer routine, tick_periodic()
中断函数更多的工作则是有tick_periodic()
函数完成:
- Increment the jiffies_64 count by one.
- Update resource usages, such as consumed system and user time, for the currently running process.
- Run any dynamic timers that have expired
- Execute scheduler_tick()
- Update the wall time, which is stored in xtime
- Calculate the infamous load average.
tick_periodic()简化后的代码
static void tick_periodic(int cpu)
{
if (tick_do_timer_cpu == cpu) {
write_seqlock(&xtime_lock);
/* Keep track of the next tick event */
tick_next_period = ktime_add(tick_next_period, tick_period);
do_timer(1);
write_sequnlock(&xtime_lock);
}
update_process_times(user_mode(get_irq_regs()));
profile_tick(CPU_PROFILING);
}
大多数工作由do_timer()
和update_process_times()
完成,前者实际上完成了对jiffies_64
的加1操作
void do_timer(unsigned long ticks)
{
jiffies_64 += ticks;
update_wall_time();
calc_global_load();
}
update_wall_time()
负责更新wall time. calc_global_load()
更新系统负责统计信息。
void update_process_times(int user_tick)
{
struct task_struct *p = current;
int cpu = smp_processor_id();
/* Note: this timer irq context must be accounted for as well. */
account_process_tick(p, user_tick);
run_local_timers();
rcu_check_callbacks(cpu, user_tick);
printk_tick();
scheduler_tick();
run_posix_cpu_timers(p);
}
接下来account_process_tick
负责更新进程时间信息
void account_process_tick(struct task_struct *p, int user_tick)
{
cputime_t one_jiffy_scaled = cputime_to_scaled(cputime_one_jiffy);
struct rq *rq = this_rq();
if (user_tick)
account_user_time(p, cputime_one_jiffy, one_jiffy_scaled);
else if ((p != rq->idle) || (irq_count() != HARDIRQ_OFFSET))
account_system_time(p, HARDIRQ_OFFSET, cputime_one_jiffy,
one_jiffy_scaled);
else
account_idle_time(cputime_one_jiffy);
}
You might realize that this approach implies that the kernel credits a process for running the entire previous tick in whatever mode the processor was in when the timer interrupt occurred. In reality, the process might have entered and exited kernel mode many times during the last tick. In fact, the process might not even have been the only process running in the last tick! This granular process accounting is classic Unix, and without much more complex accounting, this is the best the kernel can provide. It is also another reason for a higher frequency tick rate.
run_local_timers()
负责marks softirq,来执行任何超时的定时器。scheduler_tick()
负责减少正在运行进程的时间片,如果需要的话并设置need_resched
标志,在SMP系统中,它还会balances the per-cpu runqueues 。
The Time of Day
一天中的时间定义在kernel/time/timekeeping.c
struct timespec xtime;
// The timespec data structure is defined in <linux/time.h> as:
struct timespec {
__kernel_time_t tv_sec; /* seconds */
long tv_nsec; /* nanoseconds */
};
The xtime.tv_sec value stores the number of seconds that have elapsed since January 1, 1970 (UTC). The xtime.tv_nsec value stores the number of nanoseconds that have elapsed in the last second.
为了读写xtime,需要操作读写锁。
write_seqlock(&xtime_lock);
/* update xtime ... */
write_sequnlock(&xtime_lock);
/* Reading xtime requires the use of the read_seqbegin() and read_seqretry() functions:*/
unsigned long seq;
do {
unsigned long lost;
seq = read_seqbegin(&xtime_lock);
usec = timer->get_offset();
lost = jiffies - wall_jiffies;
if (lost)
usec += lost * (1000000 / HZ);
sec = xtime.tv_sec;
usec += (xtime.tv_nsec / 1000);
} while (read_seqretry(&xtime_lock, seq));
This loop repeats until the reader is assured that it read the data without an intervening write. If the timer interrupt occurred and updated xtime during the loop, the returned sequence number is invalid and the loop repeats. ???
用户态获取wall time的接口是 gettimeofday()
,对应的函数是sys_gettimeofday()
asmlinkage long sys_gettimeofday(struct timeval *tv, struct timezone *tz)
{
if (likely(tv)) {
struct timeval ktv;
do_gettimeofday(&ktv); //架构相关函数
if (copy_to_user(tv, &ktv, sizeof(ktv)))
return -EFAULT;
}
if (unlikely(tz)) {
if (copy_to_user(tz, &sys_tz, sizeof(sys_tz)))
return -EFAULT;
}
return 0;
}
内核也提供了time()
系统调用,但gettimeofday()
基本上取代了它。C库函数也提供了相关wall time操作函数,比如ftime
和ctime
。 settimeofday()
可以设置wall time,但是需要CAP_SYS_TIME
属性。
除了更新 xtime 之外,内核不会像用户空间那样频繁地使用当前的wall time。一个例外是VFS中inode会存储各种时间戳(accessed,modifed等)
Timers
Timers—sometimes called dynamic timers or kernel timers—are essential for managing the flow of time in kernel code
定时器使用
定时器由struct timer_list
表示
struct timer_list {
struct list_head entry; /* entry in linked list of timers */
unsigned long expires; /* expiration value, in jiffies */
void (*function)(unsigned long); /* the timer handler function */
unsigned long data; /* lone argument to the handler */
struct tvec_t_base_s *base; /* internal timer field, do not touch */
};
使用步骤
#include <linux/timer.h>
// 1. 定义一个timer
struct timer_list my_timer;
// 2. 初始化timer
init_timer(&my_timer);
// 3. 设置超时时间
my_timer.expires = jiffies + delay; /* timer expires in delay ticks */
my_timer.data = 0; /* zero is passed to the timer handler */
my_timer.function = my_function; /* function to run when timer expires */
// 4. 激活timer
add_timer(&my_timer)
data
参数允许同一个超时函数注册多个timer,通过data
区分。
调整已经激活的timer
mod_timer(&my_timer, jiffies + new_delay); /* new expiration */
mod_timer()
也能对已经初始化但是还没激活的timer操作。如果timer未激活,该函数会激活它,如果timer之前是inactive返回0,如果之前是active返回1.
在timer超时前关闭timer可以用
del_timer(&my_timer);
这个函数对激活还是未激活的timer都适用,如果timer active返回1,如果timer inactive返回0。对于超时的timer会自动销毁,不用调用del_timer。
del_timer_sync(&my_timer)
删除所有其他处理器上(SMP)可能正在执行的timer,这个函数不能用于中断上下文。
Timer Race Conditions
因为Timer是异步运行的,在某些场景下存在资源竞争,因此需要特别注意对共享数据的包含。
删除timer的时候推荐使用del_timer_sync()
而不是del_timer()
,因为不能确定当前timer是否已经在其他CPU上已经在运行了
Timer 实现
Timer在内核中是以中断下半部实现的,一种softirq。当时钟中断到来时,会调用到run_local_timers
。
void run_local_timers(void)
{
hrtimer_run_queues();
raise_softirq(TIMER_SOFTIRQ); /* raise the timer softirq */
softlockup_tick();
}
TIMER_SOFTIRQ
由run_timer_softirq()
处理,这个函数会 处理当前处理器的所有超时timer.
延迟执行
空循环
The simplest solution to implement (although rarely the optimal solution) is busy waiting or busy looping.
unsigned long timeout = jiffies + 10; /* ten ticks */
while (time_before(jiffies, timeout))
;
这种方式是很粗暴的,非常浪费CPU,一种比较好的方式是允许调度
while (time_before(jiffies, delay))
cond_resched();
The call to cond_resched()schedules a new process, but only if need_resched is set.也就是说,如果有更重要的进程需要执行,那么让出CPU,当然这只能在进程上下文中。这里还有一点需要了解的是jiffies
是volatile
的,所有while每次循环都会从内存中取值,避免了一些问题。
小延时
某些场景下,比如硬件相关的初始化,需要延迟一定时间,而jiffies
相关的延时,如果是100HZ,那么延时最小也只有10ms,而有时需要延迟更小的时间。
内核提供了一些延时函数,定义在<linux/delay.h>
和<asm/delay.h>
中
void udelay(unsigned long usecs)
void ndelay(unsigned long nsecs)
void mdelay(unsigned long msecs)
The udelay() function should be called only for small delays because larger delays on fast machines might result in overflow.As a rule, do not use udelay() for delays more than one millisecond in duration. For longer durations, mdelay() works fine.
schedule_timeout()
延迟执行的更佳方法是使用 schedule_timeout()。这个调用让你的任务进入睡眠状态,直到至少指定的时间已经过去。当然无法保证,睡眠时间恰好是指定的时间。 使用方法很简单
/* set task’s state to interruptible sleep */
set_current_state(TASK_INTERRUPTIBLE);
/* take a nap and wake up in “s” seconds */
schedule_timeout(s * HZ);
The task must be in one of these two states before schedule_timeout() is called or else the task will not go to sleep.
schedule_timeout实现
schedule_timeout是通过timer实现的,代码实现很直接
/**
* schedule_timeout - sleep until timeout
* @timeout: timeout value in jiffies
*
* Make the current task sleep until @timeout jiffies have
* elapsed. The routine will return immediately unless
* the current task state has been set (see set_current_state()).
*
* You can set the task state as follows -
*
* %TASK_UNINTERRUPTIBLE - at least @timeout jiffies are guaranteed to
* pass before the routine returns unless the current task is explicitly
* woken up, (e.g. by wake_up_process())".
*
* %TASK_INTERRUPTIBLE - the routine may return early if a signal is
* delivered to the current task or the current task is explicitly woken
* up.
*
* The current task state is guaranteed to be TASK_RUNNING when this
* routine returns.
*
* Specifying a @timeout value of %MAX_SCHEDULE_TIMEOUT will schedule
* the CPU away without a bound on the timeout. In this case the return
* value will be %MAX_SCHEDULE_TIMEOUT.
*
* Returns 0 when the timer has expired otherwise the remaining time in
* jiffies will be returned. In all cases the return value is guaranteed
* to be non-negative.
*/
signed long __sched schedule_timeout(signed long timeout)
{
struct process_timer timer;
unsigned long expire;
switch (timeout)
{
case MAX_SCHEDULE_TIMEOUT:
/*
* These two special cases are useful to be comfortable
* in the caller. Nothing more. We could take
* MAX_SCHEDULE_TIMEOUT from one of the negative value
* but I' d like to return a valid offset (>=0) to allow
* the caller to do everything it want with the retval.
*/
schedule();
goto out;
default:
/*
* Another bit of PARANOID. Note that the retval will be
* 0 since no piece of kernel is supposed to do a check
* for a negative retval of schedule_timeout() (since it
* should never happens anyway). You just have the printk()
* that will tell you if something is gone wrong and where.
*/
if (timeout < 0) {
printk(KERN_ERR "schedule_timeout: wrong timeout "
"value %lx\n", timeout);
dump_stack();
current->state = TASK_RUNNING;
goto out;
}
}
expire = timeout + jiffies;
timer.task = current;
timer_setup_on_stack(&timer.timer, process_timeout, 0);
__mod_timer(&timer.timer, expire, 0);
schedule();
del_singleshot_timer_sync(&timer.timer);
/* Remove the timer from the object tracker */
destroy_timer_on_stack(&timer.timer);
timeout = expire - jiffies;
out:
return timeout < 0 ? 0 : timeout;
}
schedule_timeout
设置超时函数为process_timeout
,然后激活timer并schedule出去。当时间到达后
static void process_timeout(struct timer_list *t)
{
struct process_timer *timeout = from_timer(timeout, t, timer);
wake_up_process(timeout->task);
}
This function puts the task in the TASK_RUNNING state and places it back on the runqueue.
但该进程重新被调度时,它接着schedule_timeout()
后面继续执行。
In case the task was awakened prematurely (if a signal was received), the timer is destroyed.The function then returns the time slept.
switch()
代码块是不常用的。
The MAX_SCHEDULE_TIMEOUT check enables a task to sleep indefinitely,If you do this, you must have another method of waking your task up!