mat Memory mapped files for parsing storage (proposal for comment)

Memory mapped files for parsing storage (proposal for comment)

Open eclipsewebmaster opened this issue 9 months ago • 24 comments

| --- | --- | | Bugzilla Link | 572512 | | Status | NEW | | Importance | P3 normal | | Reported | Apr 01, 2021 01:18 EDT | | Modified | May 19, 2022 12:37 EDT | | Version | 1.11 | | See also | Gerrit change 180543, Git commit 039199c8, Gerrit change 180553, Git commit 542be78a | | Reporter | Jason Koch |

Description

Created attachment 286016
mmap_index_writer_base.patch

Current parsing of files requires reading large arrays into heap storage, which provides a minimum bound on the footprint of Pass1/Pass2 parsing.

This patch moves the parsing to off-heap storage, specifically to a file-backed region. This allows parsing larger files within a significantly smaller heap footprint.

Observations:

There is a choice here between on-heap and off-heap. On-heap is convenient in that the JVM is clearly bounded, but the downside of course is that it is clearly bounded. Off-heap simplifies this, especially with file-backed, as it allows the OS to spill pages as-needed to disk and expand to handle vastly larger files with a given heap size.
Nice feature with off-heap is that sparse region/arrays do not take up any physical space. For example if an array is allocated that is N entries long but only 1/4 of those are used, a sparse file FS (most modern OS) will only allocate physical resources for what is used in the application. For anonymous regions, only the pages that are used will be mapped. This is a distinct difference to on-heap, where the memory is allocated/paged in even if it is not required.
There is also a choice between off-heap anonymous (using available RAM, spilled via swap), and off-heap named file (using available FS cache or paged to a file on file system). The current implementation picks a named file, but chooses Java's default /tmp. There is no guarantee this is fast or appropriate for the user's device.
Following comments on list, I assume that most users are using SSD which supports fast random storage. Spinning disk performance likely suffers if any spill is required -> in this case we should switch to anonymous off-heap which guarantees that storage is in memory.

Comments:

I haven't performed any speed-of-parsing performance testing, though I think the difference between "cannot parse due to not enough ram" vs "can parse" is easy, it is less obvious to me whether there will be a penalty in this model for heaps that fit in RAM. This warrants testing further if the approach looks sensible to others.
I have specifically written this so that, I think, it should be easy to swap in different implementations, such as off heap file, off heap anonymous, on-heap, etc, and stick to interface. It's possible we could provide this as an option to the user or even switch implementations dynamically?

:notepad_spiral: mmap_index_writer_base.patch

May 08 '24 19:05 eclipsewebmaster

mat mat copied to clipboard

Memory mapped files for parsing storage (proposal for comment)

Description

mat
mat copied to clipboard