Further writeback optimisation possible?
Currently dm-wb is able to writeback 1,5-2MB/s for totally random 4k i/o by ordering the segments.
bcache is able to write back with a much higher speed by merging or ordering I/Os sequentially. I think this is possible by keeping a pool of unwritten segments. Currently i work with a permanent pool of 10GB data. So it keeps always 10GB of data on the writeback device and selects then mergable and sequential data to writeback.
Is something like this possible?
I have no plan for further optimization because writing back data in FIFO manner fits log-structured caching as I explained before.
And I am wondering if such too extravagant optimization is paid off. Does it track all the unwritten data? So how much memory footprint does it require? And the computation time for finding sequentiality if the cache device is too big?
I want you to know HDDs are enough smart devices. I measured the effect of the sorting and it was quite positive although the input is random. This means HDDs perform well if the data is sorted by ascending order, even they aren't sequential. Iow, what I don't trust is I/O scheduler, not the HDDs.
mhm makes sense - seems there is no other way than FIFO. bcache works fine currently its just not maintained by kent anymore as he has started focussing on bcachefs. So we have to keep a lot of extra patches ourself. I don't know enoug about the internals it keeps some hash tree to calculate and measure this.
That't too bad.
In my opinion, bcache has its codebase too complicated and too huge. Only Kent can grab everything in his software. (dm-cache is in similar status)
dm-writeboost on the other hand, keeps the codebase as small as possible and it's only 5k lines. And it has been forked by a researcher for his research work. (https://bitbucket.org/yongseokoh/dm-src) I want other developers to join dm-writeboost so I can reduce my tasks.
Yes the bcache code is very complex. That's why i like dm-writeboost. It looks lightweight. Another question Does dm-writeboost not track the written blocks? What happens if i write block A and than read block A but block A is still in the ram buffer or on the caching device. How does dm-writeback knows it has to answer / read the data not from the backing device?
You are welcome.
Does dm-writeboost not track the written blocks? What happens if i write block A and than read block A but block A is still in the ram buffer or on the caching device. How does dm-writeback knows it has to answer / read the data not from the backing device?
dm-writeboost manages a structure called metablock that corresponds to a 4KB cache block. metablock manages "dirtiness" of the cache block.
struct dirtiness {
bool is_dirty;
u8 data_bits;
};
struct metablock {
sector_t sector; /* The original aligned address */
u32 idx; /* Const. Index in the metablock array */
struct hlist_node ht_list; /* Linked to the hash table */
struct dirtiness dirtiness;
};
the member dirtiness has is_dirty and data_bits. is_dirty means if it's still dirty which is almost the same as if it needs to be written back (e.g. if the cache block is a result of read caching, the flag is false). data_bits has 8 bits to manage 8 sectors in each 4KB cache block for the existence of cached data.
For deeper understanding, please read process_bio function.
make some test with fio and compare with enhanceio, what I found is, if the random write range is small, for example 4G file with 30G cache dev size, enhanceio do a good work on merge writeback request, the avg flush throughout is 4-10M, while dw-boostwrite is 3-5M。 if write range is very large like 48G,both speed is not much high?
I've just make a util to count writeback-flushout time, so I'll come back soon with actual datas.
@bash99 It's quite dependent on the IO amount and the distribution. Please give me the detail of your benchmark.
By the way, it's not dw-boostwrite but dm-writeboost you must be using.
@bash99 Also, please share how you setup your dm-writeboost'd device. It's important to know the max_batched_writeback in particular.
I don't know what kind of optimization enhanceio does but generally saying, this could be happening because dm-writeboost's writeback is restricted to do from the older segment to newer segment. Think what happens if newer segment is written back before older ones and the caching device gets suddenly broken.
Basicly it's an official testbox, a 240G Samsung 843T /dev/sda and a 500G Seagate Barracuda 7200.14 /dev/sdb, an i5-4590 with 8G memory.
use cache size 31.2GB and back device size 150GB.
[root@LizardFS186 ~]# parted /dev/sda print
Number Start End Size Type File system Flags
2 53.7GB 204GB 150GB primary xfs
3 204GB 354GB 150GB primary
[root@LizardFS186 ~]# parted /dev/sda print
Number Start End Size File system Name Flags
4 125GB 156GB 31.2GB primary
5 156GB 187GB 31.2GB primary
I use /dev/sda3 /dev/sdb5 as writeboost setup.
[root@LizardFS186 ~]# cat /etc/writeboosttab
webhd /dev/sda3 /dev/sdb5 writeback_threshold=85,read_cache_threshold=32,update_sb_record_interval=60
and /dev/sda2 /dev/sdb4 as eio setup
eio_cli create -d /dev/sda2 -s /dev/sdb4 -p lru -m wb -c enchanceio_test
use this funtion to wait back device flush down
wait_for_io()
{
dev=$1
awk -v target="$2" '
$14 ~ /^[0-9.]+$/ {
if($14 <= target) { exit(0); }
}' < <(iostat -xy $dev 3)
}
use below scripts to kick off eio cleanup short after fio start
sleep 3
sysctl -w dev.enhanceio.enchanceio_test.do_clean=1
use below scripts to make sure dirty block is really been cleaned.
grep -i dirty /proc/enhanceio/enchanceio_test/stats
dmsetup status webhd | wb_status | grep dirty
main test script is below, one for writeboost, one for eio, limit total iops run by use rate_iops。
./waitio.sh sda 1; time fio --direct=1 --filename=/dev/mapper/webhd --name fio_randw --refill_buffers --ioengine=libaio --rw=randwrite --bs=4k --size=4G --nrfiles=1 --thread --numjobs=16 --iodepth=32 --time_based --runtime=5 --group_reporting --norandommap --rate_iops=500; time ./waitio.sh sda 1; iostat -xy sda 1 1; date; time ./waitio.sh sda 1;
./waitio.sh sda 1; ./sleepkick.sh ; time fio --direct=1 --filename=/dev/sda2 --name fio_randw --refill_buffers --ioengine=libaio --rw=randwrite --bs=4k --size=4G --nrfiles=1 --thread --numjobs=16 --iodepth=32 --time_based --runtime=5 --group_reporting --norandommap --rate_iops=500; time ./waitio.sh sda 1; iostat -xy sda 1 1; date; time ./waitio.sh sda 1;
I've change time from 5-10, change size from 48G to 4G. (change /sys/block/sda/queue/scheduler from cfq/noop/deadline make no diffirence) And I got results like this:
| Test | secs-fio | secs-flush | secs-total | MBs | io-total | iops-withflush |
|---|---|---|---|---|---|---|
| eio-48g | 5 | 150 | 155 | 160 | 40000 | 258.06 |
| dwb-48g | 5 | 182 | 187 | 160 | 40000 | 213.90 |
| eio-48g | 10 | 288 | 298 | 320 | 80000 | 268.46 |
| dwb-48g | 10 | 375 | 385 | 320 | 80000 | 207.79 |
| dwb-4g | 5 | 141 | 146 | 160 | 40000 | 273.97 |
| eio-4g | 5 | 105 | 110 | 160 | 40000 | 363.64 |
Below is all utils sh scripts.zip
@akiradeveloper Not sure EnhanceIO is totally safe on write-back if cache device crashed. but what they in ReadME is
The write-back engine in EnhanceiO has been designed from scratch.
Several optimizations have been done. IO completion guarantees have
been improved. We have defined limits to let a user control the amount
of dirty data in a cache. Clean-up of dirty data is stopped by default
under a high load; this can be overridden if required. A user can
control the extent to which a single cache set can be filled with dirty
data. A background thread cleans-up dirty data at regular intervals.
Clean-up is also done at regular intevals by identifying cache sets
which have been written least recently.
Btw, I've got some funny results from ZFS and Zil cache, but I need more test. I've share my test in google docs. SSD Cache WriteBack Test
@bash99 How do you write back dirty caches with wb? and how do you know its completion?
@bash99 As the baseline you need the result with HDD only. I think dmwb doesn't do any performance gain with the workload. Usually, client application doesn't write in such a completely sparse way. So it's meaningless to optimize for such workload
is it possible to limit the IO range by "--size=4G"?
I have done the similar experiment before (this test isn't workable now because it's not maintained)
test("writeback sorting effect") {
val amount = 128 <> 1
slowDevice(Sector.G(2)) { backing =>
fastDevice(Sector.M(129)) { caching =>
Seq(4, 32, 128, 256).foreach { batchSize =>
Writeboost.sweepCaches(caching)
Writeboost.Table(backing, caching, Map("nr_max_batched_writeback" -> batchSize)).create { s =>
XFS.format(s)
XFS.Mount(s) { mp =>
reportTime(s"batch size = ${batchSize}") {
Shell.at(mp)(s"fio --name=test --rw=randwrite --ioengine=libaio --direct=1 --size=${amount}m --ba=4k --bs=4k --iodepth=32")
Shell("sync")
Kernel.dropCaches
s.dropTransient()
s.dropCaches()
}
}
}
}
}
}
}
As the result was (at the time I have done this)
Elapsed 61.329199777: writeboost batch_size(4)
Elapsed 36.761916445: writeboost batch_size(32)
Elapsed 27.058421746: writeboost batch_size(128)
Elapsed 85.989786731: backing ONLY
This means it takes 85 seconds with HDD only and only 36 seconds with dmwb, which apparently shows dmwb boosts writeback. The dirty data amount was 64MB.
I think the max_batched_writeback is now 32 in your case and it should be 90 seconds or so to finish 160MB.
@akiradeveloper Yes, I'm not sure it's worth to do more optimize on this case, and dmwb is fast than raw HD, about 3 times fast. As Eio has known bugs when fsck back device when has dirty data in cache device, they maybe do some unsafe optimized. In my understand, dmwb should only left a old but clean file system in back device.
@bash99
As Eio has known bugs when fsck back device when has dirty data in cache device, they maybe do some unsafe optimized
That must be so but it's just a trade-off. Since dmwb is log-structured the writeback thread can easily know that some data is older/newer than the other according to the segment id. Why I made a decision to write back from older ones is the feature is very important when it comes to production use.
FYI, there is a wiki page to explain this
https://github.com/akiradeveloper/dm-writeboost/wiki/Log-structured-caching-explained
In my understand, dmwb should only left a old but clean file system in back device.
It's not guaranteed but very likely so.
Dm-writeboost (dmwb) is great in its current version, and i think there might be a room for improvement for better performance. I did some log-structured garbage collection (GC) work on SSD while working as an IBM researcher, and some experience may apply to the FIFO writeback procedure of dmwc. I would like to offer my humble ideas in the following.
I would totally agree with Akira that the log-structured nature of dmwb should be strictly kept when writing back data to the backend. The suggested process could be as follows: 1) read back all dirty blocks on $max_batched_writeback segments that are oldest; 2) filter out blocks that are already obsolete (have been re-written) ; 3) sorting and merging by LBA addresses; 4) find out neighboring dirty blocks in other segments and read them out (un-dirty them when writeback succeed) and then issue larger sequential IO requests to the backend.
Of course this may take efforts to implement, but it has the potential to make dmwc the best-performing ever and robust write cache given its log-structured nature. If we implement the above, it will make sense to always keep a pre-defined number of segments in dmwb device for the sake of sequential writeback.
Any comments?