garnet icon indicating copy to clipboard operation
garnet copied to clipboard

GC Hole for SpanByte and other ones

Open hamarb123 opened this issue 1 year ago • 8 comments
trafficstars

https://github.com/microsoft/garnet/blob/1e5c7ad3d80a57a58e233e76267e5a21f9d2113f/libs/storage/Tsavorite/cs/src/core/VarLen/SpanByte.cs#L61

this member is called here

https://github.com/microsoft/garnet/blob/1e5c7ad3d80a57a58e233e76267e5a21f9d2113f/libs/client/ClientSession/GarnetClientSessionMigrationExtensions.cs#L403

which this API is called from the public API TryWriteKeyValueSpanByte here

https://github.com/microsoft/garnet/blob/1e5c7ad3d80a57a58e233e76267e5a21f9d2113f/libs/client/ClientSession/GarnetClientSessionMigrationExtensions.cs#L296

which takes ref SpanByte.

It is a GC hole since it's never pinned or otherwise fixed.

This is also a GC hole, since the memory is not fixed (you can also just use Unsafe.SizeOf here): https://github.com/microsoft/garnet/blob/1e5c7ad3d80a57a58e233e76267e5a21f9d2113f/libs/client/Utility.cs#L28

This is also seems to be a GC hole, since the memory is never fixed on this public API: https://github.com/microsoft/garnet/blob/1e5c7ad3d80a57a58e233e76267e5a21f9d2113f/libs/storage/Tsavorite/cs/src/core/VarLen/SpanByteComparer.cs#L25

(note this list is not exhaustive)

hamarb123 avatar Mar 19 '24 08:03 hamarb123

This is a GC hole:

https://github.com/microsoft/garnet/blob/1e5c7ad3d80a57a58e233e76267e5a21f9d2113f/libs/storage/Tsavorite/cs/src/core/Index/Interfaces/CallbackInfos.cs#L106

which is called by

https://github.com/microsoft/garnet/blob/1e5c7ad3d80a57a58e233e76267e5a21f9d2113f/libs/storage/Tsavorite/cs/src/core/Index/Interfaces/CallbackInfos.cs#L94

which is a public API

(found by @DaZombieKiller)

hamarb123 avatar Mar 19 '24 09:03 hamarb123

None of these seem like real issues. We know SpanByte structs are simply a view over pre-pinned memory buffers (in Tsavorite) or pinned pre-pinned network buffers (in Garnet's network layer). Garnet is being used as a server. At best, it might be a case of things being marked public when they shouldn't have?

badrishc avatar Mar 20 '24 00:03 badrishc

None of these seem like real issues.

This project seems to use Unsafe.AsPointer a lot and it's difficult to trace down whether input is guaranteed to be on stack or not - a lot of methods/APIs seem to have a hidden "contract" for their ref X parameters and it's easy to misuse in future. GC Holes are a serious issue - they're extremely difficult to diagnose and may randomly manifest as NRE or even AVE crashing the whole application down once a week. I haven't investigated all of the listed uses, but e.g. this one: https://github.com/microsoft/garnet/blob/c16e1dd0160f0874134a19cd0fc31b2f1cea93a8/libs/client/Utility.cs#L27-L28 is a classic GC hole - GC may interupt in the middle of the method, move arr[0] pointer to a new location and, as the result, this function may return something negative or huge (pretty much, random).

PS: the project seems to also use GC.AllocateArray(, pinned: true) a lot - if it is used for short-lived objects it may create a terrible fragmentation in POH (when mixed with long-lived ones) as POH was designed specifically only for long-lived objects.

EgorBo avatar Mar 20 '24 00:03 EgorBo

Got it, the GetSize happens rarely and only at load so it's possible we did not encounter the issue in practice, that should be fixed in a linked PR.

GC.AllocateArray is being used only for large pages that need to be pooled and pinned forever. In fact I think in some paths we unpin when returning to the pool. We should review all uses and improve this over time.

badrishc avatar Mar 20 '24 01:03 badrishc

Note that in order to match and beat competing C++ caches, particularly for 99.9th percentile latencies, we had to do all the memory management ourselves when we could - which implied keeping pinned pages around for reuse over the lifetime of the server.

badrishc avatar Mar 20 '24 01:03 badrishc

a lot of methods/APIs seem to have a hidden "contract" for their ref X parameters and it's easy to misuse in future

This is correct. It was cumbersome/impossible to use byte* everywhere, particularly as C#'s type system prevented it being a generic parameter. Thankfully this is internal to the codebase, where we know exactly what we are doing. We are definitely open to other ways of handling this, that do not involve performance impact.

badrishc avatar Mar 20 '24 01:03 badrishc

particularly as C#'s type system prevented it being a generic parameter

This particular pain point can be worked around with a Pointer<T> struct, for example:

public readonly unsafe struct Pointer<T>
    where T : unmanaged // not needed in C# 11
{
    public readonly T* Value;
    public Pointer(T* value) => Value = value;
    public static implicit operator void*(Pointer<T> pointer) => pointer.Value;
    public static implicit operator T*(Pointer<T> pointer) => pointer.Value;
    // etc...
}

The main thing to keep in mind with this is that it can't be used as a return type in interop code, because it being a struct means it may be passed differently in the ABI.

I'd strongly suggest using a solution like this to handle generics instead of using refs to represent fixed addresses in the API. The usage of refs here is just so much more dangerous and much easier to get wrong despite how "safe" refs "feel" in comparison to pointers.

Even if you're as careful as possible, there is no safeguard from the language, runtime or type system that ensures you don't accidentally AsPointer an unfixed reference (something that is very easy to do by mistake when it's common to do in the codebase), and then you have a ticking GC hole time bomb.

This constructor is a good example: https://github.com/microsoft/garnet/blob/c16e1dd0160f0874134a19cd0fc31b2f1cea93a8/libs/storage/Tsavorite/cs/src/core/VarLen/UnmanagedMemoryManager.cs#L26-L37 IMO it just shouldn't exist and callers should be required to use the pointer + length constructor, making it much less likely that they will accidentally pass in dangerous values. The convenience of just being able to pass a Span<T> in comes at a great safety cost that is only really mitigated by a small remark comment. Luckily, this constructor does not currently have any usages.

DaZombieKiller avatar Mar 20 '24 01:03 DaZombieKiller

Thanks for the details. Yes, it should be possible to use wrapper structs throughout and avoid using refs to represent fixed addresses. We will be happy address this soon. The UnmanagedMemoryManager should be easily adjusted as well.

Just wanted to mention that Garnet has been running for many months in production use cases without issues, so this is not an immediate concern for users. Since Garnet (and definitely none of these above-mentioned APIs) is not programmatically accessed, there isn't a danger of users misusing it, just we system developers and future contributors. Yet, it makes sense to avoid where possible without loss of performance.

badrishc avatar Mar 20 '24 21:03 badrishc

We are experiencing a memory leak issue in production and need help resolving. Might just be a configuration issue, but not sure since I can easily reproduce it with Garnet's defaults.

https://github.com/microsoft/garnet/issues/1097

Hard to believe that Garnet is considered production-ready (according to https://github.com/microsoft/garnet/issues/114), given that checkpoint and log formats still break across versions (according to https://microsoft.github.io/garnet/docs/welcome/faq)

plhester avatar Mar 13 '25 14:03 plhester

We are experiencing a memory leak issue in production and need help resolving. Might just be a configuration issue, but not sure since I can easily reproduce it with Garnet's defaults.

The referenced issue is specifically related to expired objects, so handling on that thread.

checkpoint and log formats still break across versions

Garnet's checkpoint format changes very rarely. The formats have been remarkably stable until now, and going forward we expect only two breaking changes to storage (which will increment the minor version). We have to balance important new capabilities with format evolution, so as a research-oriented group we tend to favor the former. Our production partners invest to make sure they are able to migrate checkpoints and logs across versions if and when they break.

As far as the observations made in this original thread many months back re: unsafe code, these have been significantly improved over time and there are no known issues these days.

badrishc avatar Mar 13 '25 17:03 badrishc