nocache Reconsider use of fdatasync()

Nocache is a great way to prevent buffer cache pollution e.g. by rsync.

Unfortunately I just found out the hard way that it also totally destroys rsync performance with small-file workloads due to calling fdatasync() (via sync_if_writable() in free_unclaimed_pages()). Since this happens for every little file (incl. temp files!) on close(), performance becomes completely unacceptable. Not only is it dog slow, it is also not helpful for drive longevity.

I have for now preloaded libeatmydata in addition to nocache, which negates this effect and restores performance; however this should not be necessary. fadvise() works just fine without prior fsync() - buffer pages are flushed out in nice batches, and still marked for reuse.

Unless there is a technical reason that I'm unaware of please consider removing the use of fdatasync().

Apr 30 '14 12:04 hhoffstaette

Hi,

fadvise() works just fine without prior fsync() - buffer pages are flushed out in nice batches, and still marked for reuse.

The posix_fadvise() man page says regarding POSIX_FADV_DONTNEED semantics:

   Pages that have not yet been written out will be unaffected, so if  the
   application  wishes to guarantee that pages will be released, it should
   call fsync(2) or fdatasync(2) first.

My reading of this is that with a slow disk you would potentially get cache thrashing (open, write and close before the disk even seeks to the position)… right?

Maybe we should provide an option/env var to make the fdatasync() call optional, and set it to “make the call” by default to preserve current behaviour?

Julius

Apr 30 '14 17:04 Feh

Addendum: Let me quote Tobi Oetiker on this:

As you can see we are calling fdatasync right before calling posix_fadvise, this makes sure that all data associated with the file handle has been committed to disk. This is not done because there is any danger of loosing data. But it makes sure that that the posix_fadvise has an effect. Since the posix_fadvise function is advisory, the OS will simply ignore it, if it can not comply. At least with Linux, the effect of calling posix_fadvise(fd,0,0,POSIX_FADV_DONTNEED) is immediate. This means if you write a file and call posix_fadvise right after writing a chunk of data, it will probably have no effect at all since the data in question has not been committed to disk yet, and therefore can not be released from cache.

Apr 30 '14 17:04 Feh

Maybe we could add an option to postpone fdatasync() and posix_fadvise() right before we get out of the program, by registering files on some list and processing them in a bulk. atexit() (and close() for a specific file) looks like the right candidate to do such a thing.

Apr 30 '14 20:04 noushi

It’s a nice idea, but I wonder if it helps in this use case (lots of very small files). Do you propose to leave the fd’s open? In that case we will easily overflow somewhere if we just leave open the fd’s when the caller does a close() on them, e.g. we will have thousands of fd’s but a FD_SETSIZE is only 1024 for select(). And if you close the fd on the caller’s request, how will you then “forget” the associated pages later on?

Apr 30 '14 20:04 Feh

If the user calls close(), we should indeed close it and not keep stale fds.

I haven't thought this through yet, but the general idea remains. I'm speaking my mind here:

we keep the filenames not the fds (possible issue with deleted files, maybe we ignore it)
we process files in batch at the "end"
what happens when user calls unlink()? I suspect we have nothing to do.

Apr 30 '14 21:04 noushi

ad 1., just considering the possibility – We have no clue about what happens to a file after we close the fd. We might be right in most cases by remembering the file name and then acting on a re-opened version of that. But how do you find the name of the file that was opened? Consider multiple hard links on the same inode etc. – I think this adds a lot of potential for errors or inconsistent behavior, while not really gaining much.

Apr 30 '14 21:04 Feh

I know there are issues about that in general, but in this specific case (rsync and possibly tar and others), it makes sense. Also, this could be enabled only through some option, not by default.

how do you find the name of the file that was opened?

two ways:

when we open() the file <-- better
fstat() on the fd

Regarding hardlinks, it doesn't matter which hardlink you use to reference the corresponding cached pages, they're physically the same. Am I missing something here?

@hhoffstaette : that said, isn't there a dontneed feature for rsync that's already available?

Apr 30 '14 21:04 noushi

If the user calls close(), we should indeed close it and not keep stale fds.

FWIW: We could close it upon user request, but just before that use fcntl and F_DUPFD to duplicate it to a high number and clean it up later (or periodically, since “later” might mean billions of files later, which might be a problem).

@hhoffstaette: Any thoughs on the ideas discussed here…?

May 08 '14 22:05 Feh

@Feh Sorr for the silence. Yes, lots of ideas and even started to hack on it a bit already. :) I needed to verify that a certain system call works as expected and think things through in terms of complexity, downsides etc. Essentially your last comment is on the right track, but we need to bound the accumulated work in terms of size and space. Will provide a summary soon - later today or over the weekend.

May 09 '14 08:05 hhoffstaette

@noushi Yes, I had Tobi Oetiker's original --drop-cache patch in rsync, but it had the same behaviour (fsync for every file) and was more or less unmaintained/unhackable. Also the rsync maintainers continue to refuse to merge it - probably a good thing in hindsight, considering the bad performance implications. nocache with its library/wrapper script approach is more versatile.

May 09 '14 09:05 hhoffstaette

I needed to verify that a certain system call works as expected and think things through in terms of complexity, downsides etc. Essentially your last comment is on the right track, but we need to bound the accumulated work in terms of size and space.

Of course. Patches and pull requests welcome! :-)

May 09 '14 09:05 Feh

I'd like to use nocache in the server for rsync backups. There rsync will only read files, so it'd be best to use POSIX_FADV_NOREUSE, without any syncs of course. Therefore I think an option to chose the argument to posix_fadvise would be worth it.

Jul 11 '14 23:07 carlos-carvalho

@hhoffstaette Any new developments on your part? How would you think about a simple option that’ll just disable calling fsync()?

Jul 14 '14 19:07 Feh

Julius Plenz ([email protected]) wrote on 14 July 2014 12:48:

[1]@hhoffstaette Any new developments on your part? How would you think about a simple option that’ll just disable calling fsync()?

And the change to POSIX_FADV_NOREUSE instead of POSIX_FADV_DONTNEED. On the server rsync reads files only once.

Jul 14 '14 22:07 carlos-carvalho

And the change to POSIX_FADV_NOREUSE instead of POSIX_FADV_DONTNEED. On the server rsync reads files only once.

AFAICS, this would make the semantics correct, and that’s what nocache is also trying. But beware:

/* Hint we'll be using this file only once;
 * the Linux kernel will currently ignore this */
fadv_noreuse(fd, 0, 0);

As you can see in mm/fadvise.c, POSIX_FADV_NOREUSE is a no-op. So doing it semantically correct would mean the tool essentially does nothing for the (Linux) user.

Jul 15 '14 05:07 Feh

While looking at the (unrelated) current issue with glibc 2.28 I just realized I never commented on this. Since it's been >4 years I guess I might as well get over it.. :)

Long story short: I never ended up with the planned "complicated" solution of batching small-file writes together and doing syncfs() for a whole target fs, with a delayed fadvise() for the whole batch. Instead I just applied a patch that simply skips fsync() for small files < 8MB, and that fixed 99% of my performance problem while at the same time cleaning the page cache "well enough" after copying larger files.

If anybody cares, the commit in question is here.

Maybe this can be an inspiration for an official flag to reduce fsync() overhead. Given all my other commitments I don't see myself contributing to this anymore any time soon.

Dec 01 '18 14:12 hhoffstaette

nocache nocache copied to clipboard

Reconsider use of fdatasync()

nocache
nocache copied to clipboard