Blog copied to clipboard
Chapter 14 The Block I/O Layer
。块是文件系统的抽象——文件系统只能以块的倍数访问。虽然物理设备在扇区级别可寻址,但内核以块为单位执行所有磁盘操作。块必须是扇区的倍数,同时内核也要求块不能超过一页大小。常见的块大小为 512 字节、1 KB 和 4 KB。扇区对内核之所以重要是因为所有设备 I/O 都必须以扇区为单位进行。
Buffers and Buffer Heads
中。每一个buffer对应一个block,buffer代表内存中磁盘块的对象。一个页面能包含一个或多个block。因为内核要求数据的一些控制信息(例如来自哪个块设备和这个块设备对应的buffer是哪个),每个buffer有一个描述符,称为buffer head
,由struct buffer_head
#include <linux/buffer_head.h>
struct buffer_head {
unsigned long b_state; /* buffer state flags */
struct buffer_head *b_this_page; /* list of page’s buffers */
struct page *b_page; /* associated page */
sector_t b_blocknr; /* starting block number */
size_t b_size; /* size of mapping */
char *b_data; /* pointer to data within the page */
struct block_device *b_bdev; /* associated block device */
bh_end_io_t *b_end_io; /* I/O completion */
void *b_private; /* reserved for b_end_io */
struct list_head b_assoc_buffers; /* associated mappings */
struct address_space *b_assoc_map; /* associated address space */
atomic_t b_count; /* use count */
static inline void get_bh(struct buffer_head *bh)
static inline void put_bh(struct buffer_head *bh)
在操作一个buffer head之前,应该通过get_bh()
来释放计数。给定buffer对应的磁盘上的物理块是 b_bdev
描述的块设备上的第 b_blocknr
表示大小,也就是说,block位于从地址 b_data 开始到地址 (b_data + b_size) 结束的内存中。buffer_head存在的目的就是用来描述磁盘block与内存buffer的映射关系。
内核中块 I/O 的基本容器是 bio
结构,它在 <linux/bio.h> 中定义
This structure represents block I/O operations that are in flight (active) as a list of segments.A segment is a chunk of a buffer that is contiguous in memory.Thus, individual buffers need not be contiguous in memory. By allowing the buffers to be described in chunks, the bio structure provides the capability for the kernel to perform block I/O operations of even a single buffer from multiple locations in memory. Vector I/O such as this is called scatter-gather I/O. ???
struct bio {
sector_t bi_sector; /* associated sector on disk */
struct bio *bi_next; /* list of requests */
struct block_device *bi_bdev; /* associated block device */
unsigned long bi_flags; /* status and command flags */
unsigned long bi_rw; / * read or write? */
unsigned short bi_vcnt; /* number of bio_vecs off */
unsigned short bi_idx; /* current index in bi_io_vec */
unsigned short bi_phys_segments; /* number of segments */
unsigned int bi_size; /* I/O count */
unsigned int bi_seg_front_size; /* size of first segment */
unsigned int bi_seg_back_size; /* size of last segment */
unsigned int bi_max_vecs; /* maximum bio_vecs possible */
unsigned int bi_comp_cpu; /* completion CPU */
atomic_t bi_cnt; /* usage counter */
struct bio_vec *bi_io_vec; /* bio_vec list */
bio_end_io_t *bi_end_io; /* I/O completion method */
void *bi_private; /* owner-private method */
bio_destructor_t *bi_destructor; /* destructor method */
struct bio_vec bi_inline_vecs[0]; /* inline bio vectors */
bio结构的主要目的是表示正在进程的block I/O操作。最重要的域是bi_io_vec, bi_vcnt, bi_index
I/O vectors
bi_io_vec 字段指向 bio_vec 结构的数组。.These structures are used as lists of individual segments in this specific block I/O operation
被视为形如<page, offset, len>的向量。整个向量数组组成了buffer.
struct bio_vec {
/* pointer to the physical page on which this buffer resides */
struct page *bv_page;
/* the length in bytes of this buffer */
unsigned int bv_len;
/* the byte offset within the page where the buffer resides */
unsigned int bv_offset;
In each given block I/O operation, there are bi_vcnt vectors in the bio_vec array starting with bi_io_vec.As the block I/O operation is carried out, the bi_idx field is used to point to the current index into the array.
总而言之,每个块 I/O 请求都由一个 bio 结构表示。每个请求由一个或多个块组成,这些块存储在一组 bio_vec 结构中。 bio 结构在 bi_cnt
void bio_get(struct bio *bio)
void bio_put(struct bio *bio)
buffer_head VS bio
Request Queues
块设备维护请求队列以存储它们待处理的块 I/O 请求。请求队列由struct request_queue
队列中的每个元素是单个请求,由struct request
如果简单地按照内核发出请求的顺序向块设备发送请求,就会导致性能不佳。相反,内核将I/O操作merging and sorting
后才向磁盘发出,大大的提供了性能,完成这项工作的称为I/O scheduler
I/O 调度器工作
I/O 调度器通过管理块设备的请求队列来工作,它决定了队列中请求的顺序以及每个请求在什么时候被分派到块设备。I/O调度器管理的目标是减少寻道次数,提高系统吞吐率。整个请求队列都是都是按扇区排序的。
Linus I/O调度器
Linus调度器实现了merging and sorting
- If a request to an adjacent on-disk sector is in the queue, the existing request and the new request merge into a single request.
- If a request in the queue is sufficiently old, the new request is inserted at the tail of the queue to prevent starvation of the other, older, requests.
- If a suitable location sector-wise is in the queue, the new request is inserted there. This keeps the queue sorted by physical location on disk.
- Finally, if no such suitable insertion point exists, the request is inserted at the tail of the queue.
Deadline I/O调度器
Deadline I/O 调度器试图防止由 Linus Elevator 引起的饥饿--为了尽量减少寻道,对磁盘一个区域的大量磁盘 I/O 操作可能会无限期地使对磁盘另一部分的请求操作匮乏” // TODO
预期 I/O 调度器
虽然 Deadline I/O 调度器在最小化读取延迟方面做得很好,但它是以牺牲全局吞吐量为代价的。预期 I/O 调度程序旨在继续提供出色的读取延迟,同时也提供出色的全局吞吐量。
First, the Anticipatory I/O scheduler starts with the Deadline I/O scheduler as its base. Therefore, it is not entirely different.The Anticipatory I/O scheduler implements three queues (plus the dispatch queue) and expirations for each request, just like the Deadline I/O scheduler.The major change is the addition of an anticipation heuristic.
CFS I/O 调度器
“完整公平队列 (CFQ) I/O 调度程序是为专门的工作负载设计的 I/O 调度程序,但实际上它在多个工作负载中提供了良好的性能
The CFQ I/O scheduler assigns incoming I/O requests to specific queues based on the process originating the I/O request. The difference with the CFQ I/O scheduler is that there is one queue for each process submitting I/O. The CFQ I/O scheduler then services the queues round robin, plucking a configurable number of requests (by default, four) from each queue before continuing on to the next.
Noop I/O 调度器
第四个也是最后一个 I/O 调度程序是 Noop I/O 调度程序,之所以这么命名是因为它基本上是一个 noop——它没有做太多事情。
The Noop I/O scheduler does not perform sorting or any other form of seek-prevention whatsoever The Noop I/O scheduler does perform merging, however, as its lone chore. The Noop I/O scheduler’s lack of hard work is with reason. It is intended for block devices that are truly random-access, such as flash memory cards.