attic icon indicating copy to clipboard operation
attic copied to clipboard

make system more responsive by using the fadvise DONTNEED

Open chrysn opened this issue 9 years ago • 39 comments

while attic is running, file system access is slowed down for the rest of the system. that can be expected, but its effects could be mitigated if attic used posix_fadvise(POSIX_FADV_DONTNEED) on the files it is backuping. this tells the operating tystem that the "data will not be accessed in the near future" (man 2 posix_fadvise).

this should minimize the amount of disk cache contents dropped from ram to accomodate attic's reads, while not slowing down attic. (it won't read over the same file itself (will it?), but the kernel can't know that without being told, and might keep the files around just in case attic wants to look at them again).

chrysn avatar Mar 24 '15 16:03 chrysn

fadvise DONTNEED basically tells the kernel that the data in the cache is not needed anymore, but the data is already in the cache at that point. If the data is actually read only once, I think the solution would be to bypass the kernel cache by opening the file with O_DIRECT.

aurel32 avatar Apr 04 '15 14:04 aurel32

https://github.com/ThomasWaldmann/attic/commits/o_direct I did some O_DIRECT changes there (read the commit comments). Somehow I still see the cache growing rather quickly - I suspect it is due to writes (I only changed input file reads to use O_DIRECT).

Note: I gave up the O_DIRECT route. It is just a pain to use due to the alignment limitations imposed by O_DIRECT and python not supporting that.

ThomasWaldmann avatar Apr 08 '15 21:04 ThomasWaldmann

See PR #279 for posix_fadvise based solution, it works (on linux, py >= 3.3). \o/

Note: With py 3.2, the repo writes will still spoil the cache as 3.2 does not have os.posix_fadvise. The input data reads won't spoil the cache though as that is implemented in C and independent of Python version.

ThomasWaldmann avatar Apr 11 '15 00:04 ThomasWaldmann

Is POSIX_FADV_DONTNEED really what we want? Just because we know that we will not need a specific piece of data again it is not our business to tell the kernel to remove it from the cache. We have no way of knowing if the data was originally loaded by us or by another process and how actively used it is.

Has anyone checked what (if any) posix_fadvise settings are used by other backup solutions (In the default configuration)?

jborg avatar Apr 13 '15 21:04 jborg

from http://linux.die.net/man/2/posix_fadvise : """ Programs can use posix_fadvise() to announce an intention to access file data in a specific pattern in the future, thus allowing the kernel to perform appropriate optimizations.

The advice applies to a (not necessarily existent) region starting at offset and extending for len bytes (or until the end of the file if len is 0) within the file referred to by fd. The advice is not binding; it merely constitutes an expectation on behalf of the application. """

The phrasing "programs can ... announce an intention" and "constitutes an expectation on behalf of the application" rather clearly means to me that the scope of this is "application", not "system-wide". So the advice of the application "dontneed" is correct in our case.

I didn't do specific performance measurements, but I watched how the cache behaved:

  • without fadvise, attic blows up the cache to occupy almost all memory and I'ld bet it kills a lot of cached content of other applications all the time as long as it is running.
  • with fadvise, the cache doesn't grow, impact is minimal. and speed is about the same. I'ld expect that if another application running in parallel would need some cache, it would get and keep it most of the time - even if the non-local behaviour you suspect would happen for a little while (== after we called fadvise for that application's files).

ThomasWaldmann avatar Apr 13 '15 23:04 ThomasWaldmann

As far as the specification goes, I'd agree with your interpretation. The actual implementation is, however, what ultimately counts. From what I could tell, Linux currently responds to a DONTNEED fadvise by immediately invalidating the pages, regardless of their use by any other process. One could argue that this isn't in the spirit of the specification of posix_fadvise(), but that doesn't change the fact that such a behavior is undesirable for any backup program.

The issue with your observation of cache usage is that it you can't infer anything from it. Assume for a moment that DONTNEED does in fact evict cache pages: Would anything change in your observation?

dnnr avatar Apr 14 '15 06:04 dnnr

Rsync does use fadvise.

This page seems relevant: http://insights.oetiker.ch/linux/fadvise.html

On Tue, Apr 14, 2015 at 7:17 AM, dnnr [email protected] wrote:

As far as the specification goes, I'd agree with your interpretation. The actual implementation is, however, what ultimately counts. From what I could tell, Linux currently responds to a DONTNEED fadvise by immediately invalidating the pages, regardless of their use by any other process. One could argue that this isn't in the spirit of the specification of posix_fadvise(), but that doesn't change the fact that such a behavior is undesirable for any backup program.

The issue with your observation of cache usage is that it you can't infer anything from it. Assume for a moment that DONTNEED does in fact evict cache pages: Would anything change in your observation?

— Reply to this email directly or view it on GitHub https://github.com/jborg/attic/issues/252#issuecomment-92634918.

Dmitry Astapov

adept avatar Apr 14 '15 07:04 adept

Rsync does use fadvise. This page seems relevant: http://insights.oetiker.ch/linux/fadvise.html

This page seems to talk about a patch for rsync. Has this been accepted upstream?

jborg avatar Apr 15 '15 21:04 jborg

I believe that in the middle of that page it says "the patch has been accepted upstream"

On Wed, Apr 15, 2015 at 10:02 PM, Jonas Borgström [email protected] wrote:

Rsync does use fadvise. This page seems relevant: http://insights.oetiker.ch/linux/fadvise.html

This page seems to talk about a patch for rsync. Has this been accepted upstream?

— Reply to this email directly or view it on GitHub https://github.com/jborg/attic/issues/252#issuecomment-93568455.

Dmitry Astapov

adept avatar Apr 15 '15 21:04 adept

Hmm. Maybe I spoke too soon. Looking at https://tobi.oetiker.ch/patches/, there are patches for several rsync versions there, so maybe it remained a patch...

On Wed, Apr 15, 2015 at 10:06 PM, Dmitry Astapov [email protected] wrote:

I believe that in the middle of that page it says "the patch has been accepted upstream"

On Wed, Apr 15, 2015 at 10:02 PM, Jonas Borgström < [email protected]> wrote:

Rsync does use fadvise. This page seems relevant: http://insights.oetiker.ch/linux/fadvise.html

This page seems to talk about a patch for rsync. Has this been accepted upstream?

— Reply to this email directly or view it on GitHub https://github.com/jborg/attic/issues/252#issuecomment-93568455.

Dmitry Astapov

Dmitry Astapov

adept avatar Apr 15 '15 21:04 adept

It is pretty trivial to test. Download rsync and check. (apt-get source rsync; grep -R fadvice rsync-3.1.0)

Nope, not in rsync here.

On Wed, Apr 15, 2015 at 5:08 PM, Dmitry Astapov [email protected] wrote:

Hmm. Maybe I spoke too soon. Looking at https://tobi.oetiker.ch/patches/, there are patches for several rsync versions there, so maybe it remained a patch...

On Wed, Apr 15, 2015 at 10:06 PM, Dmitry Astapov [email protected] wrote:

I believe that in the middle of that page it says "the patch has been accepted upstream"

On Wed, Apr 15, 2015 at 10:02 PM, Jonas Borgström < [email protected]> wrote:

Rsync does use fadvise. This page seems relevant: http://insights.oetiker.ch/linux/fadvise.html

This page seems to talk about a patch for rsync. Has this been accepted upstream?

Reply to this email directly or view it on GitHub https://github.com/jborg/attic/issues/252#issuecomment-93568455.

Dmitry Astapov

Dmitry Astapov

Reply to this email directly or view it on GitHub https://github.com/jborg/attic/issues/252#issuecomment-93569920.

wscott avatar Apr 15 '15 22:04 wscott

you need to grep again for fadvise (with "s").

ThomasWaldmann avatar Apr 15 '15 22:04 ThomasWaldmann

Heh, well that is embarrassing. However the typo was in the email, I had searched for the right string. Still not there.

On Wed, Apr 15, 2015 at 6:44 PM, TW [email protected] wrote:

you need to grep again for fadvise (with "s").

Reply to this email directly or view it on GitHub https://github.com/jborg/attic/issues/252#issuecomment-93589552.

wscott avatar Apr 16 '15 02:04 wscott

take a look at the bup side before jumping into that ship. i heard they discoverd performance problems where such policies would actually remove good contents from the cache that was unrelated to backups, seriously impacting performance on production servers.

it is quite possible that POSIX_FADV_DONTNEED actually removes good pages from the cache! this could be a serious problem for database servers for example.

anarcat avatar Jun 01 '15 17:06 anarcat

@anarcat do you have some more specific info? fadvise acts on a open filehandle (that belongs to the specific file opened by the backup process for reading).

I could imagine that a simplistic fadvise kernel implementation kills the cached blocks of THAT file for all processes, but even that would be better than not using fadvise because of the lower cache flooding pressure of all the files that are not used by any other process and that do not end up / remain in the cache when using fadvise.

ThomasWaldmann avatar Jun 01 '15 17:06 ThomasWaldmann

the #bup people were kind enough to send me a few refs:

https://www.percona.com/blog/2010/04/02/fadvise-may-be-not-what-you-expect/ https://groups.google.com/forum/#!topic/bup-list/7D9b2at3MMc https://groups.google.com/forum/#!topic/bup-list/nQ24WCT1g4E

it's still subject to discussion on the bup mailing list, but please do be careful about this - i don't believe it is process-specific...

it's nice to optimise attic: but if it's done at the depends of the rest of the system that is being backed up, that doesn't sound like a good tradeoff. :)

anarcat avatar Jun 01 '15 18:06 anarcat

apparently, the issue came up in this thread:

https://groups.google.com/d/msg/bup-list/TXfSAgD9-ZM/saofDu1CdxcJ

where bup would trash the sqlite cache of a big file, which had to be reloaded in memory, which was basically breaking the site...

anarcat avatar Jun 01 '15 18:06 anarcat

@anarcat I looked through the first 3 links. Lots of guessing and gut feelings (I can do that, too and I even posted reasons why I think it is good, while they didn't really reason about why they think it's worse than without).

I didn't find anything in the google groups link of previous post about fadvise, did you post wrong url?

ThomasWaldmann avatar Jun 01 '15 21:06 ThomasWaldmann

the last link was where they discovered the issue apparently.

for me it makes sense that trashing the cache will have a performance impact. when you load a page in the kernel VM and tell the kernel to drop it when you close the FD, it will drop the page - it seems logical to me. the fact that another process was using it at the same time probably doesn't change anything.

but that's just me.

anarcat avatar Jun 01 '15 21:06 anarcat

See my April 14 comment.

ThomasWaldmann avatar Jun 01 '15 22:06 ThomasWaldmann

It shouldn't be too hard to test this experimentally. E.g., on a system with 8GB RAM, create eight 1GB files named file1 through file8 as well as a 4GB file named big. Then do:

sync; echo 3 > /proc/sys/vm/drop_caches; sync
time dd if=big of=/dev/null  [should be slow]
time dd if=big of=/dev/null  [should be fast]
use attic to backup file1 file2 ... file8
time dd if=big of=/dev/null  [might be slow]

Then do exactly the same thing, but using a version of attic patched to use fadvise. Hopefully the last line would remain fast.

To address the concerns about losing data that you want in the cache, run the test a third time, this time including the file big in the attic backup (listed first?). Hopefully the last line would remain fast.

For the record, my instinct is with TW here. Why would the kernel provide this feature if it could trash the cache used by other processes? But the only way to know is to test. And from what I've read, it may depend on kernel version, so multiple testers is probably a good idea. Maybe someone can provide a program that does the analog of cat > /dev/null with a flag indicating whether to use fadvise, to make it easier to test this cleanly?

jdchristensen avatar Jun 01 '15 23:06 jdchristensen

This article provides a very good explanation of the complexity of using fadvise on Linux: http://insights.oetiker.ch/linux/fadvise.html

It is indeed the case that FADV_DONTNEED will purge the file from the cache immediately if it is not dirty (and will do nothing if it is dirty).

I agree that this behavior isn't very helpful, but that is how it is.

It seems to me that the mincore hack in the article is not worth using.

On Mon, Jun 1, 2015 at 4:05 PM, Dan Christensen [email protected] wrote:

It shouldn't be too hard to test this experimentally. E.g., on a system with 8GB RAM, create eight 1GB files named file1 through file8 as well as a 4GB file named big. Then do:

sync; echo 3 > /proc/sys/vm/drop_caches; sync time dd if=big of=/dev/null [should be slow] time dd if=big of=/dev/null [should be fast] use attic to backup file1 file2 ... file8 time dd if=big of=/dev/null [might be slow]

Then do exactly the same thing, but using a version of attic patched to use fadvise. Hopefully the last line would remain fast.

To address the concerns about losing data that you want in the cache, run the test a third time, this time including the file big in the attic backup (listed first?). Hopefully the last line would remain fast.

For the record, my instinct is with TW here. Why would the kernel provide this feature if it could trash the cache used by other processes? But the only way to know is to test. And from what I've read, it may depend on kernel version, so multiple testers is probably a good idea. Maybe someone can provide a program that does the analog of cat > /dev/null with a flag indicating whether to use fadvise, to make it easier to test this cleanly?

— Reply to this email directly or view it on GitHub https://github.com/jborg/attic/issues/252#issuecomment-107742409.

jbms avatar Jun 02 '15 03:06 jbms

@jdchristensen that test just shows if fadvise dontneed removes the file from cache (or not). while that is a bit interesting, more interesting is comparing the effect from permanently flooding the cache with a lot of data only needed once (attic without fadvise) vs. avoiding to flood the cache (attic with fadvise).

@jbms I've read that article back then (but didn't want to put a lot of [C] code, like shown there, for a maybe negligible effect).

ThomasWaldmann avatar Jun 02 '15 08:06 ThomasWaldmann

@ThomasWaldmann I proposed running attic three times. The difference between run 1 and run 2 would exactly show that not flooding the cache gives an improvement for other applications (dd, in this case). The difference between run 1 and run 3 would show whether fadvise removes a file from the cache that was already there. Both bits of information seem important for this discussion.

If the information at http://insights.oetiker.ch/linux/fadvise.html is correct, it does seem like the mincore hack would be good to use.

jdchristensen avatar Jun 02 '15 14:06 jdchristensen

:-1: We shouldn't DONTNEED the user's files. bup reverted this to fix a reported bug. We still haven't gathered any positive proof, and Linux was fixed in 2011:

  • https://lkml.org/lkml/2011/2/21/56
    • (found via un-merged patch https://lwn.net/Articles/480930/)
  • https://github.com/torvalds/linux/commit/278df9f451dc71dcd002246be48358a473504ad0

It's possible Linux can still be improved - as suggested by the un-merged patch which implements NOREUSE as a gentler alternative. (Or that it's regressed :).

Everyone has this problem. If we don't have the resources to test this properly (or implement the mincore hack), that means we just don't have the resources to do this.

If we had any reason to be worried in the first place, we could keep the DONTNEED on attic files only. It shouldn't hurt anyone else; it'll hurt us but probably not where we care. It could account for about half the cache buildup (when we're not working with virtual machine image files or similar).

sourcejedi avatar Aug 13 '15 12:08 sourcejedi

Sorry, but I can't follow what you wanted to say.

But I am doing the fadvise DONTNEED thing in borg (after practically seeing beneficial effects), so comparing attic vs. borg (or borg with/without that call) should be easy.

I'll change things / accept pull requests for borg as soon as there is practical proof that change is needed / beneficial.

ThomasWaldmann avatar Aug 13 '15 21:08 ThomasWaldmann

@ThomasWaldmann Since you have both attic and borg with fadvise handy, you could run the three tests I proposed and see if there is a problem.

jdchristensen avatar Aug 13 '15 21:08 jdchristensen

@jdchristensen here are the results, completely as expected for me.

http://paste.thinkmo.de/rNXe3Mmm#fadv_test.txt

The problem is just that they are not that helpful in deciding whether fadvise DONTNEED is helpful or not.

With fadvise DONTNEED, they show that the cache isn't killed by the backup process (as expected). Without fadvise DONTNEED, they show that backing up some stuff kills the cache for everything else.

With fadvise DONTNEED, it kills files from the cache which are backed up (as expected, due to the simple implementation in linux). Without fadvise DONTNEED, it does not kill files from the cache which are backed up (also as expected).

So one might think one is as good as the other, but I still think fadvise DONTNEED is way better as it avoids that the cache is flooded with useless data (potentially for hours) which is much more benefit than killing the currently backed up file from the cache (in that moment, it can be cached again a second later) is harmful for the case when that file is in use.

The tests you proposed can't show that, though (and any simple test might be a bit unrealistic compared to real system behaviour).

ThomasWaldmann avatar Aug 17 '15 19:08 ThomasWaldmann

hmm... well, the problem described earlier is specifically with stuff like mysql databases that get totally flushed out of the cache, having a major performance impact on the whole system. from what i understand, that performance problem is confirmed by those tests?

since this is a corner case (and linux doesn't deal well with this (yet)), maybe we should make DONTNEED optional somehow?

anarcat avatar Aug 17 '15 22:08 anarcat

For what it's worth, I've found rsync also causes linux to fragment memory like no tomorrow creating cache entries for tiny files and then releasing some (but not all) of them over the next few hours. If you're not going to use the data again right away, it makes sense to avoid that.

YMMV, but you might check system memory usage - a significant free proportion and lots of small blocks in /proc/buddyinfo that can't be compacted with compact_memory but are when you drop_caches are a sign of fragmentation wastage. Note however that fragmentation can take time to build up - days in some cases, depending on how actively the system is used.

GreenReaper avatar Aug 18 '15 17:08 GreenReaper

fadvise SEQUENTIAL is supposed to help as well, since about 2009. Unfortunately not - at least not as well as DONTNEED. Negative results published for any other project looking at this :).

I guess the DB case was a worst-case problem because there's suddenly a lot that needs reading back, and the reads will be very random (lots of disk seeks). Btw with DONTNEED we're purging the entire file after reading each chunk - that "live database" case is really going to hate us. (Not that I think it was a good case; for a database he should really have been using LVM snapshots).

The internet says you can hack O_DIRECT reads from pure python (using mmap and readinto). I think the real annoyance would be needing manual buffering and a readahead thread. If it wasn't for the threading I'd be pretty eager to code it.

mincore() looks pretty ugly especially with the mmap(). Maybe the performance side isn't too bad, if you avoid actually touching any of the pages (and you can batch the calls up a bit).

sourcejedi avatar Aug 19 '15 13:08 sourcejedi

@ThomasWaldmann Thanks for doing the experiment! Now we know for sure how linux handles DONTNEED (at least for your kernel version). It's really unfortunate that DONTNEED kills things that were already in the cache, but that's life.

I suspect that for most uses, using DONTNEED will be much better. But I suspect that some use cases might suffer, so providing a command line option to disable it seems reasonable (as @anarcat suggested).

jdchristensen avatar Aug 19 '15 15:08 jdchristensen

@sourcejedi i also tried using O_DIRECT, but that was a total pain.

ThomasWaldmann avatar Aug 19 '15 19:08 ThomasWaldmann

@ThomasWaldmann, @jdchristensen: My results, running it (Borg, 8cf0ead693) on a HDD via USB3, Intel Core M 5Y10 using Fedora 22: https://gist.github.com/pguth/481980bd67993984eda4

perguth avatar Aug 20 '15 10:08 perguth

Testing this way the newest code ("call fadvise DONTNEED for the byterange we actually have read") I got these results (before/after): https://gist.github.com/pguth/4b436cf15c58549cbc4d/revisions

perguth avatar Aug 20 '15 12:08 perguth

Right, so my concern in #158 wasn't an issue (<1%), and it was tested on a HDD over USB (so low-speed io). Another entry for the journal of negative results, good work everyone :).

I'm surprised. I guess this HDD might actually do enough read-ahead internally, or the kernel code we're using doesn't work how I thought. (I still like not hammering DONTNEED multiple times if someone else is reading the file too :).

sourcejedi avatar Aug 20 '15 13:08 sourcejedi

@sourcejedi yep, just had the same idea. :)

ThomasWaldmann avatar Aug 20 '15 15:08 ThomasWaldmann

I think, the "cleaning cache of db files when using DONTNEED problem" is because people are doing things wrong way. Why do they backup a working database? If you need a consistent backup, please either:

  • stop the db and perform backup OR
  • perform DUMP and backup that one

You don't want to backup the live&working db because of possible inconsistencies. If you use the right way (dump or db halt), there will be no problems with DONTNEED.

neutrinus avatar Sep 02 '15 08:09 neutrinus

@neutrinus very true! I already thought the same, but did not document it yet.

ThomasWaldmann avatar Sep 02 '15 11:09 ThomasWaldmann