SmtpServer icon indicating copy to clipboard operation
SmtpServer copied to clipboard

Efficiently storing messages to disk in v8

Open barclayadam opened this issue 3 years ago • 5 comments

I'm investigating an upgrade to v8 from v7. We have an IMessageStore that stores to disk TO, FROM, some header information and then the body is streamed as-is.

Given the changes to Pipelines and ReadOnlySequence<byte> I'm trying to figure out the most efficient way of now achieving this.

I've tried:

  • Naive loop around the buffer with calls to WriteAsync. This is terrible
  • Wrap the FileStream with a BufferedStream. This helps
  • Load everything into MemoryStream and make one call to WriteAsync on the FileStream. This is the fastest but defeats the purpose of streaming

I've not done a massive dive into exactly why trying to directly stream is horribly inefficient, but I suspect it's because the buffer has so many segments. For a file that ends up being 6074KB, there are 6,219,045 segments in the buffer, which seems horribly inefficient. For larger files this becomes a huge number.

I think this is because of https://github.com/cosullivan/SmtpServer/blob/master/Src/SmtpServer/IO/PipeReaderExtensions.cs#L153 in particular as if I'm reading this correctly it is slicing the incoming data into 3 byte segments

Would be interested to hear your thoughts on this, and whether there is something that can be improved in SmtpServer itself

For reference the 3 implementations from above are:

await using var fileStream = new FileStream(
                    dataFilePath,
                    FileMode.Create,
                    FileAccess.Write,
                    FileShare.None,
                    2048,
                    FileOptions.WriteThrough | FileOptions.Asynchronous);

Direct write

var position = buffer.GetPosition(0);
while (buffer.TryGet(ref position, out var memory))
{
    await fileStream.WriteAsync(memory, cancellationToken);
}

await fileStream.FlushAsync(cancellationToken);

Buffered write

var output = new BufferedStream(fileStream, 2048);

foreach (var x in buffer)
{
    await output.WriteAsync(x, cancellationToken);
}

await output.FlushAsync(cancellationToken);
await fileStream.FlushAsync(cancellationToken);

MemoryStream write

var memoryStream = new MemoryStream();

foreach (var x in buffer)
{
    await memoryStream.WriteAsync(x, cancellationToken);
}

memoryStream.Position = 0;

await memoryStream.CopyToAsync(fileStream, cancellationToken);

barclayadam avatar Jan 28 '21 13:01 barclayadam

Hi,

The changes to using Pipeline's was to avoid (or more so delay) as much of the memory allocations as possible. Version 7 used to work by reading the data from the socket and copying that into a MemoryStream which was then passed through to the IMessageStore.

Using the Pipelines means that the SmtpServer can pass through a reference to the data directly from the sockets buffer and avoid allocations. Whilst that looks good for the benchmarking of the SmtpServer, in some cases I envisaged that it would just delay the overall performance in that the consumer of the package might need to copy that buffer to a stream anyway and thus the performance would be similar to v7 anyway.

If you are seeing that many segments than this is likely to cause some performance issues. The Dot Stuffing shouldn't be occurring all the time unless your message content has a lot of lines that are starting with a "." character or there is a bug in the code somewhere.

Are you able to share a copy of the message you are testing this with so I can look at what the issues might be?

Thanks, Cain.

cosullivan avatar Jan 29 '21 03:01 cosullivan

Hi @cosullivan

Thanks for getting back so quick. If I must read into memory fully, and that is the same as v7, that would be OK but I believe from limited testing that we should be able to get much closer to being able to stream directly and therefore avoid that (I'm not interested in the email, I just want to store to disk as quickly as possible).

I have tried with a few files and they all seem to end up with thousands of segments.

As an example I grabbed the first image file from Unsplash:

unsplash-xps

That file is 4,387,297 bytes and results in 6,012,802 segments (that count is simply from outputting buffer.Length in the IMessageStore). For reference I use MailKit to create the test email:

var message = new MimeMessage();

var from = new MailboxAddress("Admin", "[email protected]");
message.From.Add(from);

var to = new MailboxAddress("User", "[email protected]");
message.To.Add(to);

message.Subject = "This is email subject";

var bodyBuilder = new BodyBuilder
{
    HtmlBody = "<h1>Hello World!</h1>",
};

bodyBuilder.Attachments.Add("D:\\unsplash-xps.png");

message.Body = bodyBuilder.ToMessageBody();

barclayadam avatar Feb 03 '21 12:02 barclayadam

Hi @barclayadam,

The number you are seeing here is not actually the number of segments in the buffer, but rather the total size of the buffer in bytes.

I created the MimeMessage using MailKit and then saved that file to disk and Windows Explorer reports the file size as 6,012,854 bytes (Mime Encoding of an image will expand the file size quite some margin)

image

For me there are only 1,468 segments in the buffer, you can count that by enumerating the items in the buffer;

var count = 0;
var size = 0;
foreach (var memory in buffer)
{
    count++;
    size += memory.Length;
}
Console.WriteLine("Segments={0}", count);
Console.WriteLine("Total Segment Length={0}", size);
Console.WriteLine("Expected Size from Disk={0}", 6012856);
Console.WriteLine("buffer.Length={0}", buffer.Length);

which gives the following output;

Segments=1468
Total Segment Length=6012854
Expected Size from Disk=6012856
buffer.Length=6012854

All but one of those segments are 4096 bytes in length as that is the size that the Network socket is reading. I added that new message to the benchmark and have the following which doesn't indicate anything out of the ordinary.

| Method |        Mean |     Error |      StdDev |     Gen 0 |    Gen 1 | Gen 2 |  Allocated |
|------- |------------:|----------:|------------:|----------:|---------:|------:|-----------:|
|  Send1 |    302.6 us |   5.68 us |     5.31 us |    8.7891 |        - |     - |   26.52 KB |
|  Send2 |  1,524.7 us |  14.65 us |    12.99 us |   13.6719 |        - |     - |   42.26 KB |
|  Send3 | 23,940.3 us | 476.51 us | 1,304.44 us |  625.0000 | 187.5000 |     - | 1911.54 KB |
|  Send4 | 50,694.5 us | 982.15 us | 1,470.03 us | 2000.0000 | 900.0000 |     - | 6159.67 KB |

I suspect the issue is purely around the performance of writing to disk which would make sense that a single a single write from a memory stream would be the quickest option.

I don't think there are any .NET APIs yet that allow File IO using Pipelines, but I will see if there is a way I can allow an option to increase the internal buffer size of 4096 bytes.

Thanks, Cain.

cosullivan avatar Feb 03 '21 13:02 cosullivan

@barclayadam

Actually, try writing with your first option (the Direct Write) but increase your FileStream buffer from 2048 to 32768 and see if that makes a difference as that should give close to the the same affect as increasing any internal buffers.

cosullivan avatar Feb 03 '21 13:02 cosullivan

@cosullivan

Sorry for the delay getting back to you. I did some further testing and, you are right, increasing the FileStream buffer does, of course, improve performance.

It does not get on par with reading all into memory and then copying across to the FileStream, but then it does use more slightly better when looking at max consumption values

Results below are sending 100 emails that are just over 8MB on disk when saved (note that this is full send and ack times from another local program):

Direct Write w/ 2048 buffer: avg 1174ms Direct Write w/ 32,768 buffer: avg 160ms Direct Write w/ 65,536 buffer: avg 136ms Direct Write w/ 131,072 buffer: avg 110ms Direct Write w/ 196,608 buffer: avg 110ms

Full in-memory copy w/ 32,768 buffer: avg 115ms Full in-memory copy w/ 196,608 buffer: avg 103ms

For an email that results in a 121,454 KB file on disk (our service supports up to 150mb emails. Not the average size we'll see but need to test those numbers to ensure nothing wild happening):

Direct Write w/ 131,072 buffer: avg 1486ms Direct Write w/ 196,608 buffer: avg 1594ms Full in-memory copy w/ 196,608: avg 1491ms

I think the much larger buffers (the largest, 131,072) end up with a high GC cost, because I notice that the time it takes to save to disk slows down over time. i.e. the first emails save in around 300ms, by the end I'm seeing >600ms (note this is the time to just save to disk, which is why it's smaller than the values above). I'm not sure if this is a symptom of the OS and hardware throttling, or something within the application itself.

Something interesting I see, which is probably nothing of note, is that I see a 1 bit difference in email sizes when writing to disk (as reported by buffer.Length). The exact same message is sent 100 times, but the buffer length flips between 124,368,199 and 124,368,198 in no particular pattern. Thought it was an odd thing to see.

TL;DR Given the current implementation probably not much more that can be done given the .NET APIs, but would be interesting to see if having control over the internal buffer size can help

barclayadam avatar Feb 23 '21 18:02 barclayadam