garnet
garnet copied to clipboard
GC Hole for SpanByte and other ones
https://github.com/microsoft/garnet/blob/1e5c7ad3d80a57a58e233e76267e5a21f9d2113f/libs/storage/Tsavorite/cs/src/core/VarLen/SpanByte.cs#L61
this member is called here
https://github.com/microsoft/garnet/blob/1e5c7ad3d80a57a58e233e76267e5a21f9d2113f/libs/client/ClientSession/GarnetClientSessionMigrationExtensions.cs#L403
which this API is called from the public API TryWriteKeyValueSpanByte here
https://github.com/microsoft/garnet/blob/1e5c7ad3d80a57a58e233e76267e5a21f9d2113f/libs/client/ClientSession/GarnetClientSessionMigrationExtensions.cs#L296
which takes ref SpanByte.
It is a GC hole since it's never pinned or otherwise fixed.
This is also a GC hole, since the memory is not fixed (you can also just use Unsafe.SizeOf here):
https://github.com/microsoft/garnet/blob/1e5c7ad3d80a57a58e233e76267e5a21f9d2113f/libs/client/Utility.cs#L28
This is also seems to be a GC hole, since the memory is never fixed on this public API: https://github.com/microsoft/garnet/blob/1e5c7ad3d80a57a58e233e76267e5a21f9d2113f/libs/storage/Tsavorite/cs/src/core/VarLen/SpanByteComparer.cs#L25
(note this list is not exhaustive)
This is a GC hole:
https://github.com/microsoft/garnet/blob/1e5c7ad3d80a57a58e233e76267e5a21f9d2113f/libs/storage/Tsavorite/cs/src/core/Index/Interfaces/CallbackInfos.cs#L106
which is called by
https://github.com/microsoft/garnet/blob/1e5c7ad3d80a57a58e233e76267e5a21f9d2113f/libs/storage/Tsavorite/cs/src/core/Index/Interfaces/CallbackInfos.cs#L94
which is a public API
(found by @DaZombieKiller)
None of these seem like real issues. We know SpanByte structs are simply a view over pre-pinned memory buffers (in Tsavorite) or pinned pre-pinned network buffers (in Garnet's network layer). Garnet is being used as a server. At best, it might be a case of things being marked public when they shouldn't have?
None of these seem like real issues.
This project seems to use Unsafe.AsPointer a lot and it's difficult to trace down whether input is guaranteed to be on stack or not - a lot of methods/APIs seem to have a hidden "contract" for their ref X parameters and it's easy to misuse in future. GC Holes are a serious issue - they're extremely difficult to diagnose and may randomly manifest as NRE or even AVE crashing the whole application down once a week. I haven't investigated all of the listed uses, but e.g. this one: https://github.com/microsoft/garnet/blob/c16e1dd0160f0874134a19cd0fc31b2f1cea93a8/libs/client/Utility.cs#L27-L28 is a classic GC hole - GC may interupt in the middle of the method, move arr[0] pointer to a new location and, as the result, this function may return something negative or huge (pretty much, random).
PS: the project seems to also use GC.AllocateArray(, pinned: true) a lot - if it is used for short-lived objects it may create a terrible fragmentation in POH (when mixed with long-lived ones) as POH was designed specifically only for long-lived objects.
Got it, the GetSize happens rarely and only at load so it's possible we did not encounter the issue in practice, that should be fixed in a linked PR.
GC.AllocateArray is being used only for large pages that need to be pooled and pinned forever. In fact I think in some paths we unpin when returning to the pool. We should review all uses and improve this over time.
Note that in order to match and beat competing C++ caches, particularly for 99.9th percentile latencies, we had to do all the memory management ourselves when we could - which implied keeping pinned pages around for reuse over the lifetime of the server.
a lot of methods/APIs seem to have a hidden "contract" for their ref X parameters and it's easy to misuse in future
This is correct. It was cumbersome/impossible to use byte* everywhere, particularly as C#'s type system prevented it being a generic parameter. Thankfully this is internal to the codebase, where we know exactly what we are doing. We are definitely open to other ways of handling this, that do not involve performance impact.
particularly as C#'s type system prevented it being a generic parameter
This particular pain point can be worked around with a Pointer<T> struct, for example:
public readonly unsafe struct Pointer<T>
where T : unmanaged // not needed in C# 11
{
public readonly T* Value;
public Pointer(T* value) => Value = value;
public static implicit operator void*(Pointer<T> pointer) => pointer.Value;
public static implicit operator T*(Pointer<T> pointer) => pointer.Value;
// etc...
}
The main thing to keep in mind with this is that it can't be used as a return type in interop code, because it being a struct means it may be passed differently in the ABI.
I'd strongly suggest using a solution like this to handle generics instead of using refs to represent fixed addresses in the API. The usage of refs here is just so much more dangerous and much easier to get wrong despite how "safe" refs "feel" in comparison to pointers.
Even if you're as careful as possible, there is no safeguard from the language, runtime or type system that ensures you don't accidentally AsPointer an unfixed reference (something that is very easy to do by mistake when it's common to do in the codebase), and then you have a ticking GC hole time bomb.
This constructor is a good example:
https://github.com/microsoft/garnet/blob/c16e1dd0160f0874134a19cd0fc31b2f1cea93a8/libs/storage/Tsavorite/cs/src/core/VarLen/UnmanagedMemoryManager.cs#L26-L37
IMO it just shouldn't exist and callers should be required to use the pointer + length constructor, making it much less likely that they will accidentally pass in dangerous values. The convenience of just being able to pass a Span<T> in comes at a great safety cost that is only really mitigated by a small remark comment. Luckily, this constructor does not currently have any usages.
Thanks for the details. Yes, it should be possible to use wrapper structs throughout and avoid using refs to represent fixed addresses. We will be happy address this soon. The UnmanagedMemoryManager should be easily adjusted as well.
Just wanted to mention that Garnet has been running for many months in production use cases without issues, so this is not an immediate concern for users. Since Garnet (and definitely none of these above-mentioned APIs) is not programmatically accessed, there isn't a danger of users misusing it, just we system developers and future contributors. Yet, it makes sense to avoid where possible without loss of performance.
We are experiencing a memory leak issue in production and need help resolving. Might just be a configuration issue, but not sure since I can easily reproduce it with Garnet's defaults.
https://github.com/microsoft/garnet/issues/1097
Hard to believe that Garnet is considered production-ready (according to https://github.com/microsoft/garnet/issues/114), given that checkpoint and log formats still break across versions (according to https://microsoft.github.io/garnet/docs/welcome/faq)
We are experiencing a memory leak issue in production and need help resolving. Might just be a configuration issue, but not sure since I can easily reproduce it with Garnet's defaults.
The referenced issue is specifically related to expired objects, so handling on that thread.
checkpoint and log formats still break across versions
Garnet's checkpoint format changes very rarely. The formats have been remarkably stable until now, and going forward we expect only two breaking changes to storage (which will increment the minor version). We have to balance important new capabilities with format evolution, so as a research-oriented group we tend to favor the former. Our production partners invest to make sure they are able to migrate checkpoints and logs across versions if and when they break.
As far as the observations made in this original thread many months back re: unsafe code, these have been significantly improved over time and there are no known issues these days.