dm-writeboost Further writeback optimisation possible?

Currently dm-wb is able to writeback 1,5-2MB/s for totally random 4k i/o by ordering the segments.

bcache is able to write back with a much higher speed by merging or ordering I/Os sequentially. I think this is possible by keeping a pool of unwritten segments. Currently i work with a permanent pool of 10GB data. So it keeps always 10GB of data on the writeback device and selects then mergable and sequential data to writeback.

Is something like this possible?

Jul 15 '15 19:07 disaster123

I have no plan for further optimization because writing back data in FIFO manner fits log-structured caching as I explained before.

And I am wondering if such too extravagant optimization is paid off. Does it track all the unwritten data? So how much memory footprint does it require? And the computation time for finding sequentiality if the cache device is too big?

I want you to know HDDs are enough smart devices. I measured the effect of the sorting and it was quite positive although the input is random. This means HDDs perform well if the data is sorted by ascending order, even they aren't sequential. Iow, what I don't trust is I/O scheduler, not the HDDs.

Jul 16 '15 00:07 akiradeveloper

mhm makes sense - seems there is no other way than FIFO. bcache works fine currently its just not maintained by kent anymore as he has started focussing on bcachefs. So we have to keep a lot of extra patches ourself. I don't know enoug about the internals it keeps some hash tree to calculate and measure this.

Jul 16 '15 06:07 disaster123

That't too bad.

In my opinion, bcache has its codebase too complicated and too huge. Only Kent can grab everything in his software. (dm-cache is in similar status)

dm-writeboost on the other hand, keeps the codebase as small as possible and it's only 5k lines. And it has been forked by a researcher for his research work. (https://bitbucket.org/yongseokoh/dm-src) I want other developers to join dm-writeboost so I can reduce my tasks.

Jul 16 '15 06:07 akiradeveloper

Yes the bcache code is very complex. That's why i like dm-writeboost. It looks lightweight. Another question Does dm-writeboost not track the written blocks? What happens if i write block A and than read block A but block A is still in the ram buffer or on the caching device. How does dm-writeback knows it has to answer / read the data not from the backing device?

Jul 16 '15 10:07 disaster123

You are welcome.

Does dm-writeboost not track the written blocks? What happens if i write block A and than read block A but block A is still in the ram buffer or on the caching device. How does dm-writeback knows it has to answer / read the data not from the backing device?

dm-writeboost manages a structure called metablock that corresponds to a 4KB cache block. metablock manages "dirtiness" of the cache block.

struct dirtiness {
    bool is_dirty;
    u8 data_bits;
};

struct metablock {
    sector_t sector; /* The original aligned address */

    u32 idx; /* Const. Index in the metablock array */

    struct hlist_node ht_list; /* Linked to the hash table */

    struct dirtiness dirtiness;
};

the member dirtiness has is_dirty and data_bits. is_dirty means if it's still dirty which is almost the same as if it needs to be written back (e.g. if the cache block is a result of read caching, the flag is false). data_bits has 8 bits to manage 8 sectors in each 4KB cache block for the existence of cached data.

For deeper understanding, please read process_bio function.

Jul 16 '15 13:07 akiradeveloper

make some test with fio and compare with enhanceio, what I found is, if the random write range is small, for example 4G file with 30G cache dev size, enhanceio do a good work on merge writeback request, the avg flush throughout is 4-10M, while dw-boostwrite is 3-5M。 if write range is very large like 48G，both speed is not much high？

I've just make a util to count writeback-flushout time, so I'll come back soon with actual datas.

Nov 25 '16 03:11 bash99

@bash99 It's quite dependent on the IO amount and the distribution. Please give me the detail of your benchmark.

By the way, it's not dw-boostwrite but dm-writeboost you must be using.

Nov 25 '16 06:11 akiradeveloper

@bash99 Also, please share how you setup your dm-writeboost'd device. It's important to know the max_batched_writeback in particular.

Nov 25 '16 07:11 akiradeveloper

I don't know what kind of optimization enhanceio does but generally saying, this could be happening because dm-writeboost's writeback is restricted to do from the older segment to newer segment. Think what happens if newer segment is written back before older ones and the caching device gets suddenly broken.

Nov 25 '16 07:11 akiradeveloper

Basicly it's an official testbox, a 240G Samsung 843T /dev/sda and a 500G Seagate Barracuda 7200.14 /dev/sdb, an i5-4590 with 8G memory.

use cache size 31.2GB and back device size 150GB.

[root@LizardFS186 ~]# parted /dev/sda print
Number  Start   End     Size    Type     File system  Flags
 2      53.7GB  204GB   150GB   primary  xfs
 3      204GB   354GB   150GB   primary

[root@LizardFS186 ~]# parted /dev/sda print
Number  Start   End     Size    File system     Name     Flags
 4      125GB   156GB   31.2GB                  primary
 5      156GB   187GB   31.2GB                  primary

I use /dev/sda3 /dev/sdb5 as writeboost setup.

[root@LizardFS186 ~]# cat /etc/writeboosttab 
webhd  /dev/sda3 /dev/sdb5	writeback_threshold=85,read_cache_threshold=32,update_sb_record_interval=60

and /dev/sda2 /dev/sdb4 as eio setup

eio_cli create -d /dev/sda2 -s /dev/sdb4 -p lru -m wb -c enchanceio_test

use this funtion to wait back device flush down

wait_for_io()
{
  dev=$1
  awk -v target="$2" '
    $14 ~ /^[0-9.]+$/ {
      if($14 <= target) { exit(0); }
    }' < <(iostat -xy $dev 3)
}

use below scripts to kick off eio cleanup short after fio start

sleep 3
sysctl -w dev.enhanceio.enchanceio_test.do_clean=1

use below scripts to make sure dirty block is really been cleaned.

grep -i dirty /proc/enhanceio/enchanceio_test/stats
dmsetup status webhd | wb_status | grep dirty

main test script is below, one for writeboost, one for eio, limit total iops run by use rate_iops。

./waitio.sh sda 1; time fio --direct=1 --filename=/dev/mapper/webhd --name fio_randw --refill_buffers --ioengine=libaio  --rw=randwrite --bs=4k --size=4G --nrfiles=1 --thread --numjobs=16 --iodepth=32 --time_based --runtime=5 --group_reporting --norandommap --rate_iops=500; time ./waitio.sh sda 1; iostat -xy sda 1 1; date; time ./waitio.sh sda 1; 


./waitio.sh sda 1; ./sleepkick.sh ; time fio --direct=1 --filename=/dev/sda2 --name fio_randw --refill_buffers --ioengine=libaio  --rw=randwrite --bs=4k --size=4G --nrfiles=1 --thread --numjobs=16 --iodepth=32 --time_based --runtime=5 --group_reporting --norandommap --rate_iops=500; time ./waitio.sh sda 1; iostat -xy sda 1 1; date; time ./waitio.sh sda 1;

I've change time from 5-10, change size from 48G to 4G. (change /sys/block/sda/queue/scheduler from cfq/noop/deadline make no diffirence) And I got results like this:

Test	secs-fio	secs-flush	secs-total	MBs	io-total	iops-withflush
eio-48g	5	150	155	160	40000	258.06
dwb-48g	5	182	187	160	40000	213.90
eio-48g	10	288	298	320	80000	268.46
dwb-48g	10	375	385	320	80000	207.79
dwb-4g	5	141	146	160	40000	273.97
eio-4g	5	105	110	160	40000	363.64

Below is all utils sh scripts.zip

Nov 25 '16 13:11 bash99

@akiradeveloper Not sure EnhanceIO is totally safe on write-back if cache device crashed. but what they in ReadME is

	The write-back engine in EnhanceiO has been designed from scratch.
	Several optimizations have been done. IO completion guarantees have
	been improved. We have defined limits to let a user control the amount
	of dirty data in a cache. Clean-up of dirty data is stopped by default
	under a high load; this can be overridden if required. A user can
	control the extent to which a single cache set can be filled with dirty
	data. A background thread cleans-up dirty data at regular intervals.
	Clean-up is also done at regular intevals by identifying cache sets
	which have been written least recently.

Btw, I've got some funny results from ZFS and Zil cache, but I need more test. I've share my test in google docs. SSD Cache WriteBack Test

Nov 25 '16 14:11 bash99

@bash99 How do you write back dirty caches with wb? and how do you know its completion?

Nov 25 '16 14:11 akiradeveloper

@bash99 As the baseline you need the result with HDD only. I think dmwb doesn't do any performance gain with the workload. Usually, client application doesn't write in such a completely sparse way. So it's meaningless to optimize for such workload

Nov 25 '16 14:11 akiradeveloper

is it possible to limit the IO range by "--size=4G"?

Nov 25 '16 14:11 akiradeveloper

I have done the similar experiment before (this test isn't workable now because it's not maintained)

  test("writeback sorting effect") {
    val amount = 128 <> 1
    slowDevice(Sector.G(2)) { backing =>
      fastDevice(Sector.M(129)) { caching =>
        Seq(4, 32, 128, 256).foreach { batchSize =>
          Writeboost.sweepCaches(caching)
          Writeboost.Table(backing, caching, Map("nr_max_batched_writeback" -> batchSize)).create { s =>
            XFS.format(s)
            XFS.Mount(s) { mp =>
              reportTime(s"batch size = ${batchSize}") {
                Shell.at(mp)(s"fio --name=test --rw=randwrite --ioengine=libaio --direct=1 --size=${amount}m --ba=4k --bs=4k --iodepth=32")
                Shell("sync")
                Kernel.dropCaches
                s.dropTransient()
                s.dropCaches()
              }
            }
          }
        }
      }
    }
  }

As the result was (at the time I have done this)

Elapsed 61.329199777: writeboost batch_size(4)
Elapsed 36.761916445: writeboost batch_size(32)
Elapsed 27.058421746: writeboost batch_size(128)
Elapsed 85.989786731: backing ONLY

This means it takes 85 seconds with HDD only and only 36 seconds with dmwb, which apparently shows dmwb boosts writeback. The dirty data amount was 64MB.

I think the max_batched_writeback is now 32 in your case and it should be 90 seconds or so to finish 160MB.

Nov 25 '16 14:11 akiradeveloper

@akiradeveloper Yes, I'm not sure it's worth to do more optimize on this case, and dmwb is fast than raw HD, about 3 times fast. As Eio has known bugs when fsck back device when has dirty data in cache device, they maybe do some unsafe optimized. In my understand, dmwb should only left a old but clean file system in back device.

Nov 26 '16 12:11 bash99

@bash99

As Eio has known bugs when fsck back device when has dirty data in cache device, they maybe do some unsafe optimized

That must be so but it's just a trade-off. Since dmwb is log-structured the writeback thread can easily know that some data is older/newer than the other according to the segment id. Why I made a decision to write back from older ones is the feature is very important when it comes to production use.

FYI, there is a wiki page to explain this

https://github.com/akiradeveloper/dm-writeboost/wiki/Log-structured-caching-explained

In my understand, dmwb should only left a old but clean file system in back device.

It's not guaranteed but very likely so.

Nov 26 '16 12:11 akiradeveloper

Dm-writeboost (dmwb) is great in its current version, and i think there might be a room for improvement for better performance. I did some log-structured garbage collection (GC) work on SSD while working as an IBM researcher, and some experience may apply to the FIFO writeback procedure of dmwc. I would like to offer my humble ideas in the following.

I would totally agree with Akira that the log-structured nature of dmwb should be strictly kept when writing back data to the backend. The suggested process could be as follows: 1) read back all dirty blocks on $max_batched_writeback segments that are oldest; 2) filter out blocks that are already obsolete (have been re-written) ; 3) sorting and merging by LBA addresses; 4) find out neighboring dirty blocks in other segments and read them out (un-dirty them when writeback succeed) and then issue larger sequential IO requests to the backend.

Of course this may take efforts to implement, but it has the potential to make dmwc the best-performing ever and robust write cache given its log-structured nature. If we implement the above, it will make sense to always keep a pre-defined number of segments in dmwb device for the sake of sequential writeback.

Any comments?

Jan 04 '18 15:01 samuelxhu