hydrus ClientImportFiles.py::ClientImportJob.DoWork() performs unnecessary file copy

As of python3.5 shutil.move() automatically calls copy_function if move fails (e.g. because src and dst are not on the same volume) So merge file should be called here to perform a move or copy as required. If the tmpdir of a client and the storage are on the same volume this saves a copy.

https://github.com/hydrusnetwork/hydrus/blob/18517d8a73b7fecaa7a5c8dcf61d30ccbfcb9b51/hydrus/client/importing/ClientImportFiles.py#L174 FileImportJob.DoWork() Calls ClientFilesManager.AddFile() Which calls HydrusPaths.MirrorFile() instead of HydrusPaths.MergeFile(). https://github.com/hydrusnetwork/hydrus/blob/18517d8a73b7fecaa7a5c8dcf61d30ccbfcb9b51/hydrus/client/ClientFiles.py#L176

To make this modification you must first fix the copy_function setting of MergeFile() As noted in issue https://github.com/hydrusnetwork/hydrus/issues/989 Also AddFile probably also has other callers which may depend on copying behavior, so maybe FileImportJob should use a different function from AddFile instead of patching it.

Of course users with quite large collections who keep their data files on different volumes are unlikely to see an uptick in speed, unless they also set the tempdir to that volumes, which can confuse sqlite, and so is inadvisable without the ability to distinguish the download tmpdir from the system tmpdir. But that could be a thing that is possible in the future.

Oct 17 '21 09:10 bbappserver

I put a bit of time into this this week and did a little refactoring regarding temp files in general. One small issue is my current temp_path routine that I use in many places, like here:

https://github.com/hydrusnetwork/hydrus/blob/18517d8a73b7fecaa7a5c8dcf61d30ccbfcb9b51/hydrus/client/ClientDownloading.py#L310

Along with the temp path, it grabs the os_file_handle, which is a low level thing that comes atomically with it. Maybe it is safe to close that early, I am not sure, but in either case I don't clean that up until all work is done, so in order to do what you want here and be able to move the temp file to real storage, I'll need to adjust all of the places I do this so I close that handle before the move in a safe way.

Rather than hack it all together, I'll do a proper job here and update FileImportJob to take responsibility for eating the temp file, so this will be a little delayed.

Oct 27 '21 00:10 hydrusnetwork

@hydrusnetwork Having moved to linux my /tmp (tmpfs )now lives in ram so this is less of a big deal, on thinking further. It also is not that big a deal on systems that literally write a tmp file since it should still be in I/O cache, so it will still be written twice and deleted, but at least when you do the copy after writing it will still be in RAM(or art least mostly be in ram if it's a big ass file).

Obviously it makes the network engine more complex if you have to write tmp files to multiple places, and even more complex if you decide to parallelize it. So even though I advocate for not writing needlessly, don't feel like it is that big a priority.

May 13 '22 09:05 bbappserver