arrow icon indicating copy to clipboard operation
arrow copied to clipboard

ARROW-14635: [Python][C++] implement fadvise

Open marsupialtail opened this issue 2 years ago • 4 comments

After: https://github.com/apache/arrow/pull/13640, seems like O_DIRECT is not a good idea, so let's use posix_fadvise to control the page cache to address the issue mentioned in: https://issues.apache.org/jira/browse/ARROW-14635.

To test it, use the simple python script below:

SIZE = 1024 * 1024
N = 1024 * 10

import pyarrow.fs as fs
#da  = fs.LocalFileSystem(reuse=False) #this will turn on fadvise and disable page cache for the writes
da  = fs.LocalFileSystem(reuse=True)
s = da.open_output_stream("bump")
a = bytes("1","utf-8") * SIZE
for i in range(N):
    s.write(a)
s.close()

marsupialtail avatar Jul 20 '22 21:07 marsupialtail

https://issues.apache.org/jira/browse/ARROW-14635

github-actions[bot] avatar Jul 20 '22 21:07 github-actions[bot]

Checks that fail are because posix_fadvise doesn't work on windows/mac.... I have gated the code

marsupialtail avatar Jul 21 '22 01:07 marsupialtail

For the record, I'll be out until the end of next week, but this is PR doesn't strike me as high priority ;-)

pitrou avatar Jul 21 '22 17:07 pitrou

Addressed all documentation related comments.

marsupialtail avatar Aug 04 '22 20:08 marsupialtail

Actually I misunderstood the requirements. I think that synchronous writing is not required in this use case, and it is okay to use O_DIRECT without O_SYNC, to use the SSD cache to speed up the write. In this case the write will not be persisted but it will be out of the page cache, achieving the objective first listed in the JIRA to reduce memory usage.

In this case, O_DIRECT without O_SYNC is nearly 20x faster than fadvise + O_SYNC on my system. fadvise without O_SYNC fails to reduce page cache memory usage. (Note: I still sync upon closing the file.)

I recommend we revert back to this PR: https://github.com/apache/arrow/pull/13640.

marsupialtail avatar Aug 15 '22 19:08 marsupialtail

experiments.zip

Uploaded some code that can be used for benchmarking.

marsupialtail avatar Aug 15 '22 19:08 marsupialtail

fadvise without O_SYNC fails to reduce page cache memory usage

Still the same question: how is that a problem?

pitrou avatar Aug 16 '22 09:08 pitrou

Still the same question: how is that a problem?

@pitrou Reducing the page cache memory usage is the goal of this feature. Per: https://issues.apache.org/jira/browse/ARROW-14635

The goal would be to allow for writing a large dataset without significantly impacting the server (this is important if the server is a desktop / laptop being actively used). Filling the page cache with dirty pages leads to unnecessary swapping of active user processes.

westonpace avatar Aug 16 '22 16:08 westonpace

Filling the page cache with dirty pages leads to unnecessary swapping of active user processes.

The pages shouldn't be dirty if they have been written out, should they? Also, does the swapping also occur with fadvise?

pitrou avatar Aug 16 '22 16:08 pitrou

Let's review our options here:

  1. O_DIRECT with O_SYNC. Horribly slow with the same problems with O_DIRECT, not worth it if you don't want to persist every write.
  2. O_DIRECT without O_SYNC. This is my preferred option. This does not persist each write immediately onto the SSD but uses SSD cache to move data off the page cache to save memory. This does not offer persistence for fault tolerance but saves memory for our purposes.
  3. fadvise with O_SYNC. Horribly slow. (> 15x slower than option 2 on my machine)
  4. fadvise without O_SYNC. This does not free up the page cache.

To see swapping occur, what you could do is to run a memory intensive job alongside the binary compiled with option 2 or option 4. (Make sure the SIZE and N are the same in direct.cpp and fadvise.cpp) Good option could be SIZE = 1024 * 1024 and N = 1024 * 30, i.e. write 30 GB with each write 1MB. Then while running this write job, run this python script:

import time
import numpy as np
start = time.time()
a = np.random.normal(size=(1024,1024,1024))
print(time.time() - start)

With option2: ./direct & python script.py, the write itself takes 12s and the script takes 30s on my machine. With option4: ./fadvise & python script.py, the write itself takes 25s and the script takes 40s on my machine, with it using up all free memory.

Note that option 1 and option 3 can be made faster if the write size is very large. However I think a write size of around 1MB is maybe representative of what currently happens in Arrow, e.g. with the Parquet writer.

marsupialtail avatar Aug 16 '22 18:08 marsupialtail

With option2: ./direct & python script.py, the write itself takes 12s and the script takes 30s on my machine. With option4: ./fadvise & python script.py, the write itself takes 25s and the script takes 40s on my machine, with it using up all free memory.

Thanks for the results :-) Can you just describe the system you measured this on?

However I think a write size of around 1MB is maybe representative of what currently happens in Arrow, e.g. with the Parquet writer.

Yeah, probably.

pitrou avatar Aug 17 '22 09:08 pitrou

I measured this on my Systems76 Gazelle laptop with Samsung NVME SSD 970 EVO plus and 32 GB of RAM.

marsupialtail avatar Aug 17 '22 20:08 marsupialtail

Ok, so let's just revert to the O_DIRECT proposal then :-) Thanks for taking the time to move this forward!

pitrou avatar Aug 18 '22 10:08 pitrou