diskplorer Add possibility to run a workload with 'rw=trim'

In order to check the impact of trim operations on read and write performance diskplorer.py is extended with the new argument '--sequential-trim' that allows users to enable the additional workload that trims data written k seconds before (by default k=10).

The default time offset can be changed via the new parameter: '--trim-offset-time-seconds' that denotes the offset between write and trim workloads and is expressed in seconds.

The bandwidth of trim operation is identical to the bandwith of write operation used in the given iteration.

By default the block size used by trim requests is calculated as 'trim_block_size=(write_bw*(k/2))', where write_bw depends on the iteration.

However, block size cannot be too small. Therefore, the minimum block size for the trim operation can be configured via '--min-trim-block-size'. The default minimum block size is 32MB.

If the user wants to have equal 'trim_block_size' for each iteration it can be set via '--force-trim-block-size'.

When sequential trim workload is enabled, the additional write workload that prepares data to be deleted is also run. Moreover, the original write and read workloads utilize 'offset=' parameter to ensure that they do not use trimmed data.

Apr 18 '24 08:04 pwrobelse

Hi @avikivity, please find the PR with addition of rw=trim workload to diskplorer.

Apr 24 '24 10:04 pwrobelse

Please include some results in latency-matrix-results (with links in README.md). Also include blktrace traces in the commitlog that show a fragment of a run.

Apr 24 '24 11:04 avikivity

Please include some results in latency-matrix-results (with links in README.md). Also include blktrace traces in the commitlog that show a fragment of a run.

Note the .fio directories which contain the generated files.

Apr 24 '24 12:04 avikivity

The change-log is as follows:

renamed --run-sequential-trim to --sequential-trim
introduced a new parameter --min-trim-block-size used to ensure, that trim uses at least given block size (defaults to 32MiB)
added a logic that selects the block size for trim operation as well as calculates the time required to issue at least 4 discard commands
adjusted the logic that prepares the data for trim - now it writes at least trim_block_size of data, because as soon as the workload is run a trim request with BS=trim_block_size may be issued
replaced a specific trim_data job name with {next(job_names)} - in case when multiple jobs are run as one group the result file contains only the name of the first one - thus, the special name was not so useful

May 02 '24 14:05 pwrobelse

Please supply an example run in latency-matrix-results/ (with references from README.md).

Perhaps even two: one with 32MB trims, and one with much smaller trims to demonstrate how bad it is.

May 02 '24 14:05 avikivity

Please find some results from blktrace+blkparse. In case more data is needed, I can provide the whole file - 60s trace is around 2GB.

blktrace had been started during prepare_data_for_trim job and it was running in parallel with read+write+trim. As far as I understand, the offset printed by blkparse assumes 512B blocks.

The first thing is that in each 5 seconds I can see DS (Discard-Synchronous) request submitted by FIO. It makes sense, because the used trim_block_size was write_bw*k/2, where k=10, where write_bw=120040634, which resulted in 572 MiB block size - greater than default 32MiB. Due to the write bandwidth one request had to be issued per 5 seconds.

259,0    1     9482     7.124364072  8699  Q  DS 0 + 1172272 [fio]
259,0    1     9483     7.124371614  8699  G  DS 0 + 1172272 [fio]
259,0    1     9484     7.124372665  8699  P   N [fio]
259,0    1     9485     7.124374263  8699 UT   N [fio] 1
259,0    1     9486     7.124375053  8699  I  DS 0 + 1172272 [fio]
259,0    1     9487     7.124393062   101  D  DS 0 + 1172272 [kworker/1:1H]
259,0    3     2432     7.125765420  8701  C  DS 0 + 1172272 [0]

[...]

259,0    0   987015    12.124114673  8699  Q  DS 1172272 + 1172272 [fio]

[...]

259,0    0   987017    12.124129310  8699  G  DS 1172272 + 1172272 [fio]

259,0    0   987020    12.124132412  8699  I  DS 1172272 + 1172272 [fio]

[...]

259,0    0   987024    12.124175848    77  D  DS 1172272 + 1172272 [kworker/0:1H]

259,0    1  1032289    12.125648947     0  C  DS 1172272 + 1172272 [0]

[...]

259,0    0   987081    12.125900229  8699  Q  DS 2344544 + 1172272 [fio]

The second thing is that between discard requests, I can see WS (write synchronous) and R (read) requests.

259,0    3   959682    12.124393616  8700  Q   R 177701496 + 1 [fio]
259,0    3   959683    12.124394005  8700  G   R 177701496 + 1 [fio]
259,0    3   959684    12.124394325  8700  D   R 177701496 + 1 [fio]
259,0    3   959685    12.124396938  8700  Q   R 835896982 + 1 [fio]
259,0    3   959686    12.124397542  8700  G   R 835896982 + 1 [fio]
259,0    3   959687    12.124397806  8700  D   R 835896982 + 1 [fio]
259,0    0   987030    12.124536193  8698  Q  WS 3514880 + 256 [fio]
259,0    0   987031    12.124537399  8698  G  WS 3514880 + 256 [fio]
259,0    0   987032    12.124539745  8698  D  WS 3514880 + 256 [fio]
259,0    1  1032289    12.125648947     0  C  DS 1172272 + 1172272 [0]
259,0    2  1017012    12.125691005     0  C   R 1826996167 + 1 [0]
259,0    0   987033    12.125692812  8699  C   R 924043527 + 1 [0]
259,0    2  1017013    12.125694101     0  C   R 376702341 + 1 [0]
259,0    2  1017014    12.125694738     0  C   R 1609579396 + 1 [0]
259,0    2  1017015    12.125695357     0  C   R 439600678 + 1 [0]
259,0    2  1017016    12.125695962     0  C   R 498326998 + 1 [0]
259,0    0   987034    12.125697836  8699  C   R 888870918 + 1 [0]
259,0    0   987035    12.125698584  8699  C   R 1491220998 + 1 [0]
259,0    0   987036    12.125699417  8699  C   R 1800930502 + 1 [0]
259,0    0   987037    12.125700129  8699  C   R 136082322 + 1 [0]
259,0    0   987038    12.125700873  8699  C   R 961352789 + 1 [0]
259,0    3   959688    12.125711768     0  C   R 257796490 + 1 [0]
259,0    2  1017017    12.125714338  8701  C   R 309401678 + 1 [0]

May 02 '24 14:05 pwrobelse

Interesting.

The huge discard completes in a millisecond. Clearly it didn't really complete - it was just queued by the disk. If we keep piling them on, the disk will eventually choke on all the queued work. We've seen exactly the same with writes. So pacing is necessary. Not a surprise but good to see confirmation.
The read+write are odd. They aren't part of the workload running in parallel?

May 02 '24 15:05 avikivity

2. The read+write are odd. They aren't part of the workload running in parallel?

Hi @avikivity. Please find the answer below.

The read+write are part of the workload running in parallel. To compare the output of blktrace for read+write+trim I also collected the traces for read+write. The results look similar.

259,0    7        7     0.000375921 26975  Q  WS 2864640 + 256 [fio]
259,0    7        8     0.000377171 26975  G  WS 2864640 + 256 [fio]
259,0    7        9     0.000379278 26975  D  WS 2864640 + 256 [fio]
259,0    0       19     0.000408372 26981  Q   R 715711729 + 1 [fio]
259,0    0       20     0.000409008 26981  G   R 715711729 + 1 [fio]
259,0    0       21     0.000409402 26981  D   R 715711729 + 1 [fio]
259,0    0       22     0.000412489 26981  Q   R 47439307 + 1 [fio]
259,0    0       23     0.000412909 26981  G   R 47439307 + 1 [fio]
259,0    0       24     0.000413187 26981  D   R 47439307 + 1 [fio]
259,0    0       25     0.000415492 26981  Q   R 609329307 + 1 [fio]
259,0    0       26     0.000415789 26981  G   R 609329307 + 1 [fio]
259,0    0       27     0.000416047 26981  D   R 609329307 + 1 [fio]
259,0    5       40     0.000417876     0  C   R 376955347 + 1 [0]
259,0    6       37     0.000418529     0  C   R 2968317088 + 1 [0]
259,0    7       10     0.000419634     0  C   R 2449824514 + 1 [0]
259,0    5       41     0.000423759     0  C   R 672399320 + 1 [0]
259,0    6       38     0.000423788     0  C   R 615315517 + 1 [0]
259,0    5       42     0.000437485     0  C   R 1335922978 + 1 [0]
259,0    0       28     0.000440929     0  C   R 3697026643 + 1 [0]
259,0    6       39     0.000449629     0  C   R 1474999982 + 1 [0]
259,0    0       29     0.000449808     0  C   R 2064158733 + 1 [0]
259,0    0       30     0.000450330     0  C   R 84724269 + 1 [0]
259,0    6       40     0.000454348     0  C   R 1478299699 + 1 [0]
259,0    5       43     0.000455490 26982  C   R 644217706 + 1 [0]
259,0    0       31     0.000456481     0  C   R 2635200624 + 1 [0]
259,0    2        7     0.000459449 26977  Q   R 1228640516 + 1 [fio]
259,0    2        8     0.000460377 26977  G   R 1228640516 + 1 [fio]
259,0    6       41     0.000460957     0  C   R 38201407 + 1 [0]
259,0    2        9     0.000460965 26977  D   R 1228640516 + 1 [fio]
259,0    2       10     0.000465650 26977  Q   R 787274164 + 1 [fio]
259,0    2       11     0.000466107 26977  G   R 787274164 + 1 [fio]
259,0    2       12     0.000466432 26977  D   R 787274164 + 1 [fio]
259,0    6       42     0.000468552     0  C   R 664170101 + 1 [0]
259,0    5       44     0.000469230 26982  C   R 1186353899 + 1 [0]
259,0    6       43     0.000469243     0  C   R 555552967 + 1 [0]
259,0    2       13     0.000470033 26977  Q   R 2473290180 + 1 [fio]
259,0    2       14     0.000470441 26977  G   R 2473290180 + 1 [fio]
259,0    2       15     0.000470830 26977  D   R 2473290180 + 1 [fio]
259,0    5       45     0.000478178 26982  C   R 3479163011 + 1 [0]
259,0    6       44     0.000484038     0  C   R 2920209455 + 1 [0]
259,0    4       19     0.000485207 26976  Q   R 2002892698 + 1 [fio]
259,0    5       46     0.000492594 26982  C   R 983889880 + 1 [0]
259,0    4       20     0.000493617 26976  C  WS 2864640 + 256 [0]

I did not specify --read-buffer-size. Therefore, diskplorer used the value read from {dev_path}/queue/logical_block_size, which was 512. The same in the case of --write-buffer-size. It used the default value, which was 128*1024. Thus, the blocks from blktrace seem to match the default values - blkparse assumes 512B blocks. Therefore, reads use +1 and writes use +256.

What is strange is the fact, that WS (Write Synchronous) is used. sync is not specified in any of the jobs and by default FIO uses sync=none.

sync=str Whether, and what type, of synchronous I/O to use for writes. The allowed values are: none Do not use synchronous IO, the default.

Reference: https://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-arg-sync

May 13 '24 10:05 pwrobelse

Now I don't remember why I thought they were odd.

About the write-sync part - I have vague memory that aio+dio sets that as a hint to the drive (or the scheduler) to give them higher priority. Buffered writes were already acknowledged, so they aren't latency sensitive, but dio writes are.

May 13 '24 13:05 avikivity

        dio->inode = inode;
        if (iov_iter_rw(iter) == WRITE) {
                dio->opf = REQ_OP_WRITE | REQ_SYNC | REQ_IDLE;
                if (iocb->ki_flags & IOCB_NOWAIT)
                        dio->opf |= REQ_NOWAIT;
        } else {
                dio->opf = REQ_OP_READ;
        }

May 13 '24 13:05 avikivity

btw, use blkparse -t to get timing information.

May 13 '24 13:05 avikivity

New changes since last version of the PR:

added --force-trim-block-size parameter to diskplorer.py to allow forcing certain block size for a trim (e.g. 32MB).
included generated FIO files in latency-matrix-results/i3.2xlarge.trim32MB.fio
updated README.md and added chart + JSON from the run with 32MB trim requests

Jun 05 '24 10:06 pwrobelse

diskplorer diskplorer copied to clipboard

Add possibility to run a workload with 'rw=trim'

diskplorer
diskplorer copied to clipboard