ppcp icon indicating copy to clipboard operation
ppcp copied to clipboard

cli switches for parallel workers and hash verify before closing source and dest files

Open omac777 opened this issue 6 years ago • 5 comments

Your file copy is sound, but I didn't see any switches to adjust the number of threads you copy files with in parallel.

My change request is that you provide a cli clap arg for cpu cores to use . i.e. --cpucorepercenttouse 100 In this use case if your system has 8 cores, then it would use 8 workers to copy 8 files in parallel. If we want to adjust it to a lower level to all the computer to do other things, that would be useful.

--verifybeforeclosefiles [hashcheckalgo] In this use case you would want to hashcheck the source against the hashche of the destination after copying the file. There could be a way to hash check while you have all bytes in memory before the write, that would be cheating since it's all about verify after the write and not before. Technically after all the bytes are written, but before you close the file, you may ftell the position back to the beginning of the file, read all the bytes on the destination device with the bytecount before you close the destination file after writing it. That could save precious time since I believe closefile is an expensive os operation.

omac777 avatar Apr 13 '19 22:04 omac777

First, I dont think you will get any speedup from parallelization, more likely the speed would degrade, for example on HDD it means additional head repositioning instead of linear reading. Next, the topmost progress bar shows the progress of currently copied file, and a line above it shows current filename. What is your idea for multiple files? Show a progress bar and filename for each?

acidnik avatar Apr 14 '19 23:04 acidnik

On 4/14/19 7:13 PM, acidnik wrote:

First, I dont think you will get any speedup from parallelization, more likely the speed would degrade, for example on HDD it means additional head repositioning instead of linear reading. Next, the topmost progress bar shows the progress of currently copied file, and a line above it shows current filename. What is your idea for multiple files? Show a progress bar and filename for each?

I would like to address your misunderstanding about speed degradation resulting from parallelizing file operations: 1)It depends on the underlying operating system and hardware. OS can handle the parallelism and at the lowest-level put it in sequence for the hard-drive to take it hopefully without disk-chatter so I would contest that speed-degradation perspective, but the hard-drive's health and ETA to bad blocks/drive failure will arrive more quickly. So eventually you'll have to buy another hard-drive sooner as a result of parallelizing file operations. It's expected but a problem since the data is not necessarily redundant on only one hard-drive.

2)If you are running legacy hard-drives with a more state-of-the-art configuration such as a parallel file system with many hard-drives(i.e. orangefs, wekaio, lizardfs), then what you are saying is not necessarily true. As stated before bad blocks/drive failure will arrive more quickly, but it's expected and not a problem since the data is redundant and drives can be replaced.

3)If you are using a single state-of-the-art NON-MECHANICAL nvm-e drive then what you are saying is entirely false. nvm-e drive circuitry can handle parallelism gracefully since they are not mechanical after all. The more expensive ENTERPRISE LEVEL nvm-e drives especially are designed for this in mind. NO DISK CHATTER. NO HEAD POSITIONING WEAR.

4)If you are using many nvm-e drives and with state of the art parallel file systems with many nvm-e drives, then you will achieve the optimal performance to parallelize file operations without affecting the life expectancy of the nvm-e drive since they do not suffer from disk chatter or head positioning wear.

BOTTOM LINE: don't just design it for a single mechanical drive. design it for nvm-e drives....lots of them :) I do appreciate your comment about sshfs for remote drive mounts BTW. Not sure it is the most optimal way for remote file systems though. There are many tools out there, but currently the tool of choice has been https://en.wikipedia.org/wiki/Rsync for remote file operations and does have threads control within it. It is written in C, but the code is mature(hasn't changed much for a while). It could be interesting to see ppcp morph into an rsync like written in rust.

Now going back to your interface for displaying progress on parallel file operations. You have already seen something that does parallelism with rust. The cargo build does crazy parallelism with all the cores you have. It tells you which packages it is downloading in parallel, but on one line and flips between them. The trick isn't in displaying their progress in parallel, defining what they complete(i.e. like a defined buffer size of bytes) and then sending a message to a receiving channel to show progress as you have already done is the trick. You've got one worker, but flip that into many workers but sending to the same receiving channel should be good for display progressing for all of them.

Suggestion for another switch: "--tui on" This displays tui progress. If not mentioned on the cli then defaults on.

"--tui off" just send to stdout as legacy cp/rsync tools do.

I hope this helps to make your tool even better.

omac777 avatar Apr 15 '19 00:04 omac777

Thanks for the clarifications, that was quite interesting read. But I'd really like to see some benchmarks showing that parallel copy (assuming source / destination are the same, as it would be the most common case with ppcp) makes noticeable positive difference.

--tui off will not be implemented. ppcp designed to be an interactive tool, without tui it would be just a poor clone of existing tools

acidnik avatar Apr 15 '19 09:04 acidnik

You are asking for a test case to reveal a parallel copy would outperform a sequential copy. You are also asking whether your tool would be a poor clone of existing non-tui tools if there was non-tui option. I will address both.

Try creating: 1)5 million 25KiB-size files spread across unique directories holding 4 thousand files. 2)5 million 25KiB-size files spread across unique directories holding 2 thousand files. 3)5 million 25KiB-size files spread across unique directories holding 500 files. 4)1 thousand 1 GiB-size files spread across 10 unique directories holding 100 files.

Run the test copy sequentially, then do a parallel file copy on 1)one mechanical hard-drive, 2)nvme 3)on the cloud. Then come back to me that your sequential copy accomplishes the task in the same amount of time. You're core code can be tweaked to use a parallel iterator with crossbeam sending the results from the parallel workers to tui/non-tui progress receiving channel.

I can appreciate you have a preference for tui, but why would you not offer it to those legacy tool users as a non-tui if they deem it better?

omac777 avatar Apr 16 '19 11:04 omac777

I know how to do benchmarking, you don't have to explain this to me. I was hoping that you already have some benchmarking done (or read somewhere), otherwise it sounds like your statements are pulled out of the thin air.

I just did a quick test: https://gist.github.com/acidnik/1d418c7851e144ddf5ee89cba279780f

On my machine (macbook with SSD) the numbers are the same (within the margin of error) with any $num_jobs. You can run it yourself and post your results here, maybe that will convince me

Non-tui: I firmly believe that legacy tool users can and should use cp and/or rsync for their legacy tasks. Adding this option to ppcp just not worth the effort

acidnik avatar Apr 17 '19 12:04 acidnik