arrow
arrow copied to clipboard
ARROW-14635: [Python][C++] implement fadvise
After: https://github.com/apache/arrow/pull/13640, seems like O_DIRECT is not a good idea, so let's use posix_fadvise to control the page cache to address the issue mentioned in: https://issues.apache.org/jira/browse/ARROW-14635.
To test it, use the simple python script below:
SIZE = 1024 * 1024
N = 1024 * 10
import pyarrow.fs as fs
#da = fs.LocalFileSystem(reuse=False) #this will turn on fadvise and disable page cache for the writes
da = fs.LocalFileSystem(reuse=True)
s = da.open_output_stream("bump")
a = bytes("1","utf-8") * SIZE
for i in range(N):
s.write(a)
s.close()
https://issues.apache.org/jira/browse/ARROW-14635
Checks that fail are because posix_fadvise doesn't work on windows/mac.... I have gated the code
For the record, I'll be out until the end of next week, but this is PR doesn't strike me as high priority ;-)
Addressed all documentation related comments.
Actually I misunderstood the requirements. I think that synchronous writing is not required in this use case, and it is okay to use O_DIRECT without O_SYNC, to use the SSD cache to speed up the write. In this case the write will not be persisted but it will be out of the page cache, achieving the objective first listed in the JIRA to reduce memory usage.
In this case, O_DIRECT without O_SYNC is nearly 20x faster than fadvise + O_SYNC on my system. fadvise without O_SYNC fails to reduce page cache memory usage. (Note: I still sync upon closing the file.)
I recommend we revert back to this PR: https://github.com/apache/arrow/pull/13640.
fadvise without O_SYNC fails to reduce page cache memory usage
Still the same question: how is that a problem?
Still the same question: how is that a problem?
@pitrou Reducing the page cache memory usage is the goal of this feature. Per: https://issues.apache.org/jira/browse/ARROW-14635
The goal would be to allow for writing a large dataset without significantly impacting the server (this is important if the server is a desktop / laptop being actively used). Filling the page cache with dirty pages leads to unnecessary swapping of active user processes.
Filling the page cache with dirty pages leads to unnecessary swapping of active user processes.
The pages shouldn't be dirty if they have been written out, should they? Also, does the swapping also occur with fadvise?
Let's review our options here:
- O_DIRECT with O_SYNC. Horribly slow with the same problems with O_DIRECT, not worth it if you don't want to persist every write.
- O_DIRECT without O_SYNC. This is my preferred option. This does not persist each write immediately onto the SSD but uses SSD cache to move data off the page cache to save memory. This does not offer persistence for fault tolerance but saves memory for our purposes.
- fadvise with O_SYNC. Horribly slow. (> 15x slower than option 2 on my machine)
- fadvise without O_SYNC. This does not free up the page cache.
To see swapping occur, what you could do is to run a memory intensive job alongside the binary compiled with option 2 or option 4. (Make sure the SIZE and N are the same in direct.cpp and fadvise.cpp) Good option could be SIZE = 1024 * 1024 and N = 1024 * 30, i.e. write 30 GB with each write 1MB. Then while running this write job, run this python script:
import time
import numpy as np
start = time.time()
a = np.random.normal(size=(1024,1024,1024))
print(time.time() - start)
With option2: ./direct & python script.py, the write itself takes 12s and the script takes 30s on my machine. With option4: ./fadvise & python script.py, the write itself takes 25s and the script takes 40s on my machine, with it using up all free memory.
Note that option 1 and option 3 can be made faster if the write size is very large. However I think a write size of around 1MB is maybe representative of what currently happens in Arrow, e.g. with the Parquet writer.
With option2: ./direct & python script.py, the write itself takes 12s and the script takes 30s on my machine. With option4: ./fadvise & python script.py, the write itself takes 25s and the script takes 40s on my machine, with it using up all free memory.
Thanks for the results :-) Can you just describe the system you measured this on?
However I think a write size of around 1MB is maybe representative of what currently happens in Arrow, e.g. with the Parquet writer.
Yeah, probably.
I measured this on my Systems76 Gazelle laptop with Samsung NVME SSD 970 EVO plus and 32 GB of RAM.
Ok, so let's just revert to the O_DIRECT proposal then :-) Thanks for taking the time to move this forward!