trafficserver
trafficserver copied to clipboard
TS_USE_MMAP: mmap+memcopy instead pread+pwrite
apt install libaio-dev ./configure --enable-mmap --enable-experimental-linux-native-aio ./configure --enable-mmap
@cukiernik Is this supposed to be faster than pread/pwrite? Do you have any benchmarks?
The improvement is not significant, single percentages. Speed up is due to reduce a system calls, but at the cost of increased page faults. I am testing it on a host with huge RAM. The reading from the file is done only on the missing page (instead of a system call each time). Memory writes to a file when the system is low on free pages. So performance of IO thread's is very close to that of a swap file. Further acceleration will be achieved by optimizing the agg_copy. Mmap allows aggregation into target memory, which will eliminate memcpy. I expect this to be a bigger benefit than just switching to mmap
Should I redo "push" to start "expected checks" processes?
@cukiernik Have you run all the Au tests on a build with this configured?
Are you currently at Intel, can you talk about the applications you all are using ATS for?
@ywkaras : Yes, I ran a tests, but I don't know what's on your mind when you write all?
To run all the Au tests, you cd to the root directory of your clone of the trafficserver repo, build it with your branch and the new configure option, install it, and then do:
cd tests
./autests.sh --ats-bin <install-root-directory>/bin
It will take 20 - 40 minutes to run.
Would be good to do this twice, with and without the --enable-experimental-linux-native-aio configure option for configure.
@cukiernik , A few percent sounds significant already 👍 this could be very useful. You have run it without RAM cache?
I think that for a write into a memory mapped file it has to page the data into memory first (a read), so that adds some overhead. We may be able to get some more performance here by writing with pwrite instead of memcpy. This also keeps write behaviour predictable, which is better if ATS crashes (which it never does of course). With memory mapped files it is not guaranteed that a valid index will have been written after a crash. A msync could solve this, but the order in which the msync writes the dirty pages is not guaranteed, so a bit more orchestration would be needed for msyncing the index + data. Worst case a lot of data will be written at a very inconvenient moment. Linux provides a consistent view between pwrites and memory mapped files, for other systems we'd have to check the behaviour.
You might want to take a look at this: https://github.com/apache/trafficserver/blob/3012ca0f94cf7202e75422757d7498bc7e4f9295/iocore/aio/AIO.cc#L598 I think we're hitting sync I/O here if the pages are not in memory, blocking the thread instead of using the original async nature of the function. I don't know if passing the read to a number of specialized threads would yield better performance, I suspect it will all depend on the workload. Mincore and msync can check whether pages are in memory, so that could be used to set up a fast (serve from memory) path, but the drawback would be the overhead of these functions.
Just checked whether ink_aio_read and ink_aio_write are used by anything else. In the C API we have TSAIOWrite + TSAIORead. This feature will break plugins which expect to be able to write past the end of the file with TSAIOWrite.
As the mmap stuff is experimental it I guess it is not a problem now?
Hello, after my long holidays. @ywkaras: I didn't run the au test, because pip and PyPi seem to endanger for my network security. @keesspoelstra: Yes, run it without a ram cache. I was also wondering whether to force msync after each modification, using MS_ASYNC, a add dedicated thread to MS_SYNC. As a last resort, it possible is parallel mmap in traffic_crashlog, and msync from it on fail TS_MAIN. I couldn't decide on any of the solutions, I will propose a solution without it. It is not needed to a correctly working. As I wrote above, I check it with a gigantic amount of RAM, and after modifying the whole cache, msync MS_SYNC takes more than a few minutes. In my task, I don't need to store content cache after shutdown. In this case, MAP_ANONYMOUS start faster. It is also suggested by @SolidWallOfCode, but with MAP_SHARED it's easier to trace and debug for me.
@moonchen is going to review this
I'm still reading through all the changes. One thing I've noticed is that there doesn't seem to be bounds check for the memory mapped I/O. This may introduce crashes or even security issues. Interested in hearing how others feel about this.
[approve ci autest]
This pull request has been automatically marked as stale because it has not had recent activity. Marking it stale to flag it for further consideration by the community.
@cukiernik do we still need this? If so, I can ask about it in the weekly meeting.