nocache
nocache copied to clipboard
Reconsider use of fdatasync()
Nocache is a great way to prevent buffer cache pollution e.g. by rsync.
Unfortunately I just found out the hard way that it also totally destroys rsync performance with small-file workloads due to calling fdatasync() (via sync_if_writable() in free_unclaimed_pages()). Since this happens for every little file (incl. temp files!) on close(), performance becomes completely unacceptable. Not only is it dog slow, it is also not helpful for drive longevity.
I have for now preloaded libeatmydata in addition to nocache, which negates this effect and restores performance; however this should not be necessary. fadvise() works just fine without prior fsync() - buffer pages are flushed out in nice batches, and still marked for reuse.
Unless there is a technical reason that I'm unaware of please consider removing the use of fdatasync().
Hi,
fadvise() works just fine without prior fsync() - buffer pages are flushed out in nice batches, and still marked for reuse.
The posix_fadvise()
man page says regarding POSIX_FADV_DONTNEED
semantics:
Pages that have not yet been written out will be unaffected, so if the
application wishes to guarantee that pages will be released, it should
call fsync(2) or fdatasync(2) first.
My reading of this is that with a slow disk you would potentially get cache thrashing (open
, write
and close
before the disk even seeks to the position)… right?
Maybe we should provide an option/env var to make the fdatasync()
call optional, and set it to “make the call” by default to preserve current behaviour?
Julius
Addendum: Let me quote Tobi Oetiker on this:
As you can see we are calling fdatasync right before calling
posix_fadvise
, this makes sure that all data associated with the file handle has been committed to disk. This is not done because there is any danger of loosing data. But it makes sure that that theposix_fadvise
has an effect. Since theposix_fadvise
function is advisory, the OS will simply ignore it, if it can not comply. At least with Linux, the effect of callingposix_fadvise(fd,0,0,POSIX_FADV_DONTNEED)
is immediate. This means if you write a file and callposix_fadvise
right after writing a chunk of data, it will probably have no effect at all since the data in question has not been committed to disk yet, and therefore can not be released from cache.
Maybe we could add an option to postpone fdatasync()
and posix_fadvise()
right before we get out of the program, by registering files on some list and processing them in a bulk.
atexit()
(and close()
for a specific file) looks like the right candidate to do such a thing.
It’s a nice idea, but I wonder if it helps in this use case (lots of very small files). Do you propose to leave the fd’s open? In that case we will easily overflow somewhere if we just leave open the fd’s when the caller does a close()
on them, e.g. we will have thousands of fd’s but a FD_SETSIZE
is only 1024 for select()
. And if you close the fd on the caller’s request, how will you then “forget” the associated pages later on?
If the user calls close()
, we should indeed close it and not keep stale fds.
I haven't thought this through yet, but the general idea remains. I'm speaking my mind here:
- we keep the filenames not the fds (possible issue with deleted files, maybe we ignore it)
- we process files in batch at the "end"
- what happens when user calls
unlink()
? I suspect we have nothing to do.
ad 1., just considering the possibility – We have no clue about what happens to a file after we close the fd. We might be right in most cases by remembering the file name and then acting on a re-opened version of that. But how do you find the name of the file that was opened? Consider multiple hard links on the same inode etc. – I think this adds a lot of potential for errors or inconsistent behavior, while not really gaining much.
I know there are issues about that in general, but in this specific case (rsync and possibly tar and others), it makes sense. Also, this could be enabled only through some option, not by default.
how do you find the name of the file that was opened?
two ways:
- when we
open()
the file <-- better -
fstat()
on the fd
Regarding hardlinks, it doesn't matter which hardlink you use to reference the corresponding cached pages, they're physically the same. Am I missing something here?
@hhoffstaette : that said, isn't there a dontneed feature for rsync that's already available?
If the user calls close(), we should indeed close it and not keep stale fds.
FWIW: We could close it upon user request, but just before that use fcntl
and F_DUPFD
to duplicate it to a high number and clean it up later (or periodically, since “later” might mean billions of files later, which might be a problem).
@hhoffstaette: Any thoughs on the ideas discussed here…?
@Feh Sorr for the silence. Yes, lots of ideas and even started to hack on it a bit already. :) I needed to verify that a certain system call works as expected and think things through in terms of complexity, downsides etc. Essentially your last comment is on the right track, but we need to bound the accumulated work in terms of size and space. Will provide a summary soon - later today or over the weekend.
@noushi Yes, I had Tobi Oetiker's original --drop-cache patch in rsync, but it had the same behaviour (fsync for every file) and was more or less unmaintained/unhackable. Also the rsync maintainers continue to refuse to merge it - probably a good thing in hindsight, considering the bad performance implications. nocache with its library/wrapper script approach is more versatile.
I needed to verify that a certain system call works as expected and think things through in terms of complexity, downsides etc. Essentially your last comment is on the right track, but we need to bound the accumulated work in terms of size and space.
Of course. Patches and pull requests welcome! :-)
I'd like to use nocache in the server for rsync backups. There rsync will only read files, so it'd be best to use POSIX_FADV_NOREUSE, without any syncs of course. Therefore I think an option to chose the argument to posix_fadvise would be worth it.
@hhoffstaette Any new developments on your part? How would you think about a simple option that’ll just disable calling fsync()
?
Julius Plenz ([email protected]) wrote on 14 July 2014 12:48:
[1]@hhoffstaette Any new developments on your part? How would you think about a simple option that’ll just disable calling fsync()?
And the change to POSIX_FADV_NOREUSE instead of POSIX_FADV_DONTNEED. On the server rsync reads files only once.
And the change to POSIX_FADV_NOREUSE instead of POSIX_FADV_DONTNEED. On the server rsync reads files only once.
AFAICS, this would make the semantics correct, and that’s what nocache
is also trying. But beware:
/* Hint we'll be using this file only once;
* the Linux kernel will currently ignore this */
fadv_noreuse(fd, 0, 0);
As you can see in mm/fadvise.c, POSIX_FADV_NOREUSE
is a no-op. So doing it semantically correct would mean the tool essentially does nothing for the (Linux) user.
While looking at the (unrelated) current issue with glibc 2.28 I just realized I never commented on this. Since it's been >4 years I guess I might as well get over it.. :)
Long story short: I never ended up with the planned "complicated" solution of batching small-file writes together and doing syncfs() for a whole target fs, with a delayed fadvise() for the whole batch. Instead I just applied a patch that simply skips fsync() for small files < 8MB, and that fixed 99% of my performance problem while at the same time cleaning the page cache "well enough" after copying larger files.
If anybody cares, the commit in question is here.
Maybe this can be an inspiration for an official flag to reduce fsync() overhead. Given all my other commitments I don't see myself contributing to this anymore any time soon.