Blog
Blog copied to clipboard
Chapter 14 The Block I/O Layer
块设备是硬件设备,以固定大小的数据块的随机(即不一定顺序)访问为特征。固定大小的块称为blocks
。另外一种设备是字符设备,字符设备是数据流顺序访问设备,比如键盘、串口。
解剖块设备
块设备最小寻址单元是扇区,扇区一般是2的幂次方,一般是512字节。扇区是块设备的物理属性和基本组成单元。设备不能寻址或操作小于扇区的单元,但可以是几个扇区。软件最小的寻址单元是block
。块是文件系统的抽象——文件系统只能以块的倍数访问。虽然物理设备在扇区级别可寻址,但内核以块为单位执行所有磁盘操作。块必须是扇区的倍数,同时内核也要求块不能超过一页大小。常见的块大小为 512 字节、1 KB 和 4 KB。扇区对内核之所以重要是因为所有设备 I/O 都必须以扇区为单位进行。
Buffers and Buffer Heads
当一个块存储在内存中时——比如说,在读取或等待写入之后——它存储在buffer
中。每一个buffer对应一个block,buffer代表内存中磁盘块的对象。一个页面能包含一个或多个block。因为内核要求数据的一些控制信息(例如来自哪个块设备和这个块设备对应的buffer是哪个),每个buffer有一个描述符,称为buffer head
,由struct buffer_head
表示。buffer_head包含内核需要操作buffer的所有信息。
#include <linux/buffer_head.h>
struct buffer_head {
unsigned long b_state; /* buffer state flags */
struct buffer_head *b_this_page; /* list of page’s buffers */
struct page *b_page; /* associated page */
sector_t b_blocknr; /* starting block number */
size_t b_size; /* size of mapping */
char *b_data; /* pointer to data within the page */
struct block_device *b_bdev; /* associated block device */
bh_end_io_t *b_end_io; /* I/O completion */
void *b_private; /* reserved for b_end_io */
struct list_head b_assoc_buffers; /* associated mappings */
struct address_space *b_assoc_map; /* associated address space */
atomic_t b_count; /* use count */
}
b_state
表示对应buffer状态,可以有以下状态,定义在<linux/buffer_head.h>
b_count
域是buffer使用计数,通过下面两个函数来增加和减小计数。
static inline void get_bh(struct buffer_head *bh)
{
atomic_inc(&bh->b_count);
}
static inline void put_bh(struct buffer_head *bh)
{
atomic_dec(&bh->b_count);
}
在操作一个buffer head之前,应该通过get_bh()
来增加计数,使用完后应该通过put_bh()
来释放计数。给定buffer对应的磁盘上的物理块是 b_bdev
描述的块设备上的第 b_blocknr
个逻辑块;b_page
表示buffer对应的物理内存;b_data
是指向数据的指针,b_size
表示大小,也就是说,block位于从地址 b_data 开始到地址 (b_data + b_size) 结束的内存中。buffer_head存在的目的就是用来描述磁盘block与内存buffer的映射关系。
bio结构
内核中块 I/O 的基本容器是 bio
结构,它在 <linux/bio.h> 中定义
This structure represents block I/O operations that are in flight (active) as a list of segments.A segment is a chunk of a buffer that is contiguous in memory.Thus, individual buffers need not be contiguous in memory. By allowing the buffers to be described in chunks, the bio structure provides the capability for the kernel to perform block I/O operations of even a single buffer from multiple locations in memory. Vector I/O such as this is called scatter-gather I/O. ???
struct bio {
sector_t bi_sector; /* associated sector on disk */
struct bio *bi_next; /* list of requests */
struct block_device *bi_bdev; /* associated block device */
unsigned long bi_flags; /* status and command flags */
unsigned long bi_rw; / * read or write? */
unsigned short bi_vcnt; /* number of bio_vecs off */
unsigned short bi_idx; /* current index in bi_io_vec */
unsigned short bi_phys_segments; /* number of segments */
unsigned int bi_size; /* I/O count */
unsigned int bi_seg_front_size; /* size of first segment */
unsigned int bi_seg_back_size; /* size of last segment */
unsigned int bi_max_vecs; /* maximum bio_vecs possible */
unsigned int bi_comp_cpu; /* completion CPU */
atomic_t bi_cnt; /* usage counter */
struct bio_vec *bi_io_vec; /* bio_vec list */
bio_end_io_t *bi_end_io; /* I/O completion method */
void *bi_private; /* owner-private method */
bio_destructor_t *bi_destructor; /* destructor method */
struct bio_vec bi_inline_vecs[0]; /* inline bio vectors */
};
bio结构的主要目的是表示正在进程的block I/O操作。最重要的域是bi_io_vec, bi_vcnt, bi_index
,下图展示了它们之间的关系。
I/O vectors
bi_io_vec 字段指向 bio_vec 结构的数组。.These structures are used as lists of individual segments in this specific block I/O operation
,每一个bio_venc
被视为形如<page, offset, len>的向量。整个向量数组组成了buffer.
struct bio_vec {
/* pointer to the physical page on which this buffer resides */
struct page *bv_page;
/* the length in bytes of this buffer */
unsigned int bv_len;
/* the byte offset within the page where the buffer resides */
unsigned int bv_offset;
};
In each given block I/O operation, there are bi_vcnt vectors in the bio_vec array starting with bi_io_vec.As the block I/O operation is carried out, the bi_idx field is used to point to the current index into the array.
总而言之,每个块 I/O 请求都由一个 bio 结构表示。每个请求由一个或多个块组成,这些块存储在一组 bio_vec 结构中。 bio 结构在 bi_cnt
字段中维护一个使用计数。当这个字段达到零时,结构被破坏并释放内存.
void bio_get(struct bio *bio)
void bio_put(struct bio *bio)
buffer_head VS bio
bio
结构表示一个I/O操作,内存中可能包含一个或多个页面。
// TODO
Request Queues
块设备维护请求队列以存储它们待处理的块 I/O 请求。请求队列由struct request_queue
表示,定义在<linux/blkdev.h>中。请求队列包含请求和相关控制信息的双向链表,请求由内核中的高级代码添加到队列中,例如文件系统。只要请求队列为非空,与队列关联的块设备驱动程序就会从队列的头部抓取请求并将其提交给其关联的块设备。
队列中的每个元素是单个请求,由struct request
表示。每个请求都可以由多个bio组成,因为单个请求可以对多个连续的磁盘块进行操作
I/O调度器
如果简单地按照内核发出请求的顺序向块设备发送请求,就会导致性能不佳。相反,内核将I/O操作merging and sorting
后才向磁盘发出,大大的提供了性能,完成这项工作的称为I/O scheduler
I/O 调度器工作
I/O 调度器通过管理块设备的请求队列来工作,它决定了队列中请求的顺序以及每个请求在什么时候被分派到块设备。I/O调度器管理的目标是减少寻道次数,提高系统吞吐率。整个请求队列都是都是按扇区排序的。
Linus I/O调度器
Linus调度器实现了merging and sorting
- If a request to an adjacent on-disk sector is in the queue, the existing request and the new request merge into a single request.
- If a request in the queue is sufficiently old, the new request is inserted at the tail of the queue to prevent starvation of the other, older, requests.
- If a suitable location sector-wise is in the queue, the new request is inserted there. This keeps the queue sorted by physical location on disk.
- Finally, if no such suitable insertion point exists, the request is inserted at the tail of the queue.
电梯调度算法实现在block/elevator.c
Deadline I/O调度器
Deadline I/O 调度器试图防止由 Linus Elevator 引起的饥饿--为了尽量减少寻道,对磁盘一个区域的大量磁盘 I/O 操作可能会无限期地使对磁盘另一部分的请求操作匮乏” // TODO
预期 I/O 调度器
虽然 Deadline I/O 调度器在最小化读取延迟方面做得很好,但它是以牺牲全局吞吐量为代价的。预期 I/O 调度程序旨在继续提供出色的读取延迟,同时也提供出色的全局吞吐量。
First, the Anticipatory I/O scheduler starts with the Deadline I/O scheduler as its base. Therefore, it is not entirely different.The Anticipatory I/O scheduler implements three queues (plus the dispatch queue) and expirations for each request, just like the Deadline I/O scheduler.The major change is the addition of an anticipation heuristic.
CFS I/O 调度器
“完整公平队列 (CFQ) I/O 调度程序是为专门的工作负载设计的 I/O 调度程序,但实际上它在多个工作负载中提供了良好的性能
The CFQ I/O scheduler assigns incoming I/O requests to specific queues based on the process originating the I/O request. The difference with the CFQ I/O scheduler is that there is one queue for each process submitting I/O. The CFQ I/O scheduler then services the queues round robin, plucking a configurable number of requests (by default, four) from each queue before continuing on to the next.
Noop I/O 调度器
第四个也是最后一个 I/O 调度程序是 Noop I/O 调度程序,之所以这么命名是因为它基本上是一个 noop——它没有做太多事情。
The Noop I/O scheduler does not perform sorting or any other form of seek-prevention whatsoever The Noop I/O scheduler does perform merging, however, as its lone chore. The Noop I/O scheduler’s lack of hard work is with reason. It is intended for block devices that are truly random-access, such as flash memory cards.
I/O调度器选择
可以修改启动参数elevator=foo
来选择。