ltfs icon indicating copy to clipboard operation
ltfs copied to clipboard

Read performance - ok / optimal

Open 2TAC opened this issue 4 years ago • 8 comments

I see there's another issue about read performance but that seems to be about a specific bug / environment so I'll create a new one.

Using M8 media in an HPE LTO8 autoloader, what is considered the best possible read performance currently? I'm seeing around 275-280 MB/s writes, which is very close to my raw tape speed, and 210 MB/s reads (+/- 10 MB/s). 210 MB/s is fine, but as always faster would be better. I'm using the latest master in Debian 10 on a xeon 2155 , 256 GB RAM and enterprise nvme drives. The files on the tape are 50-500 MB and I've tried using ltfs_ordered_copy and my own file copier with about the same result.

Is this the expected speed, or is there something I can do to improve it?

2TAC avatar Jan 31 '21 15:01 2TAC

Hi,

Could you try -o direct_io option of the command ltfs when you mount the LTFS ? Like

ltfs -o devname=[drive_serial] -o tape_backend=sg -o sync_type=unmount -o direct_io /mnt

In my experience, the direct io option improve the read performance in many cases, because it skips data caching while writing a file.

Generally, the kernel cache logic disposes the cache not sequentially. So drive would get LOCATE command while reading back the file. But the direct io option forces to read a file sequentially. As a result, read performance would be improved in many cases.

piste-jp avatar Feb 01 '21 08:02 piste-jp

I tried it before while looking at the other issue about read performance, I seem to remember getting lower performance with that option enabled. I will rerun the tests this evening just to be sure.

2TAC avatar Feb 01 '21 09:02 2TAC

I did some testing and now I am more confused. All speeds in 1000 based MB (my first post used MiB).

Writes : 303 MB/s (with direct_io set and unset, fairly consistent no matter what is used as source or settings)

Reads with direct_io flag used:

dd to /dev/null 301 MB/s dd with direct iflag set to ramdisk 230 MB/s dd with direct iflag and oflag set to ramdisk 188 MB/s dd with direct iflag set to nvme 229 MB/s dd with direct iflag and oflag set to nvme 284 MB/s

dd without flags to the nvme: 115 MB/s ltfs_ordered_copy to the nvme 109 MB/s my testtool using sendfile or c++ 17 copy_file (same speed) to the nvme: 112 MB/s

without direct_io flag to ltfs mount command:

dd to /dev/null 302 MB/s dd with direct iflag set to ramdisk 242 MB/s dd with direct iflag and oflag set to ramdisk 191 MB/s dd with direct iflag set to nvme 231 MB/s dd with direct iflag and oflag set to nvme 284 MB/s

dd without flags to the nvme: 224 MB/s ltfs_ordered_copy to the nvme 223 MB/s my testtool using sendfile or c++ 17 copy_file (same speed) to the nvme: 227 MiB/s

I wonder if it's worth implementing a copy tool using multiple big buffers and multithreaded read/write. At least the test writing to /dev/null shows that it might be possible to improve the speeds. The ramdisk and nvme should be more than fast enough to get the full speed if it's possible to avoid any waits in the reading code. What do you think? Why does the oflag to dd make such a difference (and in the opposite direction) for transfers to ramdisk vs nvme?

2TAC avatar Feb 01 '21 14:02 2TAC

So I borrowed some code from some of my other projects and made a copier with separate reader and writer thread. After some finetuning of block sizes and queues it seems to work quite close to the 300 MB/s raw speed. It's sensitive to changes though, just doing a sleep(1ms) instead of yield in the writer causes a noticeable drop in performance even though the block queue is kept well within capacity.

2TAC avatar Feb 01 '21 18:02 2TAC

It's interesting.

In my experience in RHEL7 and IBM's LTO Full Height drives. The multi-thread copy architecture never helps the improvement of performance before because FUSE and the drive do read-ahead implicitly. But your result says it looks that is effective. I want to understand where this difference comes.

As you know, current ltfs_ordered_copy just uses shutil.copy() or shutil.copy2(). So I never take care low level copy logic at all at this time. It just focus on reordering target files on tape by ltfs.partition and ltfs.startblock to reduce unnecessary seek.

piste-jp avatar Feb 08 '21 01:02 piste-jp

My initial c++ code also just used standard copy / sendfile and performed exactly like ltfs_ordered_copy. Now I use one input and one output thread, a block queue and direct io and I get 290 MiB/s instead of 210-220 MiB/s so I am very happy with the result on this machine. I'm restoring and duplicating some PBs of data so any speed improvement is important.

2TAC avatar Feb 10 '21 10:02 2TAC

I think it is good to implement a copy logic with a writer thread and a reader thread in the future.

piste-jp avatar Feb 11 '21 02:02 piste-jp

Wanted to chime in here that I did some tests using pysendfile modified so I could specify the exact buffer size I wanted (512kb) and it gave much better results than shutil.copy(). I would be interested to know if sendfile can be made to work with a separate reader/writer threads to improve performance even more. My speeds before using the sendfile code were about 240MB/s and about 257MB/s after.

softloft38p-michael avatar May 05 '21 00:05 softloft38p-michael