ImHex Memory usage scales poorly with array size

Memory usage scales poorly with array size

Open jam1garner opened this issue 2 years ago • 1 comments

Minimum Reproduction

Create a sufficiently large file

head -c 500M < /dev/zero > test.bin

Open in imhex

imhex test.bin

Create a somewhat large (only 8 MiB) byte array to

u8 data[0x800000] @ 0;

Hit run (wait for it to evaluate)

On my system this hits roughly 4 GiB of memory, doing a larger size (example: 0x10000000) will cause too much memory pressure and triggering oomkiller.

Further thoughts

It seems like this means approximately 512 bytes of memory per u8 (8 MiB of data on disk becomes 4 GiB in memory). From some further experimentation, this scales roughly linearly (2 MiB became ~1GiB).

(Link to previous Discord discussion)

Investigating Root Cause

PatternDataArray allocates an std::vector<PatternData*> which will result in:
- n * 216 bytes - PatternData itself
- n * 8 bytes - pointer to PatternData
- 240 bytes - PatternDataArray itself
- n * len(m_variableName) - (Not sure how this is initialized for arrays?)
- n * len(m_typeName) - (very good candidate for string interning, frankly)
- n * ~8 * len(m_highlightedAddresses) - not sure if this is used for elements of arrays
Total: minumum 224 bytes per array item, reality is ~258? (this is excluding a bit of metadata for both the allocator and std::map internals, which bump this up a notable amount)

(All sizes tested for compiler g++-10 (Ubuntu 10.3.0-1ubuntu1~20.04) 10.3.0 on x64)

Possible Solutions/Mitigations

Introduce PatternDataArrayEntry and PatternDataBase, with the goal of splitting PatternData into two classes: PatternDataBase, the minimum features (i.e. everything that's applicable for both array entries and normal variables) and PatternData (which includes everything a normal variable would need that an array doesn't need). PatternDataArrayEntry is a thin subclass of PatternDataBase that would pretty much only be used for overriding virtual methods with array-entry-specific behavior. See graph below about what a potential field split would look like.

This would maybe also allow for reducing memory fragmentation of arrays by removing the level of indirection in the vector (i.e. std::vector<PatternData*> becomes std::vector<PatternDataArrayEntry> (notice: entries stored inline in the vector, not a separate allocation each))

Edit: some mistakes in the above graph--m_color should be in PatternData, m_endian proably(?) shouldn't (as I would imagine arrays have homogenous endianesses (following similar thinking to why m_typeName should be removed from it--arrays have homogenous types))

Testing above solution

After testing the above to see what the size differences would look like (not a functional implementation, just observing how struct changes would affect memory layout), this would result in a lower bound of 7-10x reduction in size per entry, with an upper bound of ~20x per entry. However some form of sparse representation for arrays (i.e. entries are calculated on-the-fly from per-array data, rather than per-entry data being held at all times) would result in greater improvements.

Aug 01 '21 00:08 jam1garner

My current observations show that an array of your size, if applied to a 523'417'318 bytes file, results in a memory increase of ~512 Mb. This means that we've got ~64 bytes per u8 now.

Nov 09 '23 18:11 Nemoumbra

ImHex ImHex copied to clipboard

Memory usage scales poorly with array size

Minimum Reproduction

Further thoughts

Investigating Root Cause

Possible Solutions/Mitigations

Testing above solution

ImHex
ImHex copied to clipboard