EDI.Net
EDI.Net copied to clipboard
Performance & memory usage
I'm having to parse 500+MB edifact INVOIC files, but I'm having issues with even little files that only 10MB. I see the sample files are all in the 1-5KB range (I unfortunately cannot include a sample file, as it contains sensitive information.).
I would like to know have you had any success with this library on files of this size ?
This is interesting. I have used files of about 5-15 megs for my own clients in the past. I think it is easy to reproduce and find the bottleneck if we create a test for this by copy pasting a loop on one of my existing tests. One last question, which framework and version are you using (net45, net47, netcoreapp2.1, netcoreapp2.2, netcoreapp3.1) x32 or x64
One more thing are you experiencing a System out of memory (you can check you EventLog for that). What exactly is the issue.
I just tested on an average machine on dotnetcore 3.1 and it took about 3 min to load & deserialize a file of
- about 22 Megs
- a million rows
- 78880 Messages in the transaction
Memory consumption peaked around 150 megs. I did not notice in my simple test any weird memory behavior or indication of memory leak.
In order to help more you need to provide with a sample transmission with only a couple of messages inside it (that you could obfuscate their content by hand off-course) and your model and I will bloat it into a huge file and see where the problem is.
Thanks,
C.
Maybe my models are not defined correctly? So i've attached a sample project that contains the models (interchange.cs) and a sample edi file (small).
Note the actual files I'm processing have ratios as follows (names match model classes): 1 Interchange and 1 Group The Group having 5000 Messages . Each Message having 1 Invoice. Each Invoice having 500,000 Line Items
Another point related to this, is that you said it took 3 minutes to load & deserialize a 22Mb file. That sounds rather slow, considering that with some basic code its possible to iterate over every Segment/Element/Value in a 300+MB file in less than 20 seconds. Obviously there is more to it than just reading the file, but I would not expect the mapping to have such a massive impact on processing.
Below is some sample code of reading each part of an edifact file using the System.Buffers & Pipelines features of .NET
using System;
using System.Buffers;
using System.IO;
using System.IO.Pipelines;
using System.Text;
using System.Threading.Tasks;
class Program
{
const byte SegmentTerminator = (byte)'\'';
const byte DataElementSeparator = (byte)'+';
const byte ComponentDataElementSeparator = (byte)':';
const byte ReleaseCharacter = (byte)'?';
static async Task Main(string[] args)
{
if (args.Length < 1)
{
args = new string[1];
args[0] = Console.ReadLine();
}
int segment_count = 0;
var stopwatch = System.Diagnostics.Stopwatch.StartNew();
var encoding = Encoding.UTF8;
using (var filestream = new FileStream(args[0], FileMode.Open, FileAccess.Read))
{
var reader = PipeReader.Create(filestream);
while (true)
{
var result = await reader.ReadAsync();
var buffer = result.Buffer;
while (TryReadSegment(ref buffer, out var segment))
{
segment_count++;
//print out segment if we want Console.WriteLine(encoding.GetString(segment.ToArray()));
while (TryReadComponent(ref segment, out var component))
{
while (TryReadElement(ref component, out var element))
{
}
}
}
reader.AdvanceTo(buffer.Start, buffer.End);
if (result.IsCompleted) break;
}
await reader.CompleteAsync();
stopwatch.Stop();
}
Console.WriteLine("Segments: {0:N0}", segment_count);
Console.WriteLine("Duration: {0}ms", stopwatch.ElapsedMilliseconds);
Console.ReadKey();
}
static bool TryReadSegment(ref ReadOnlySequence<byte> buffer, out ReadOnlySequence<byte> segment)
{
if (buffer.Length == 0)
{
segment = ReadOnlySequence<byte>.Empty;
return false;
}
SequenceReader<byte> reader = new SequenceReader<byte>(buffer);
if (reader.TryReadTo(out segment, SegmentTerminator, delimiterEscape: ReleaseCharacter, advancePastDelimiter: true))
{
buffer = buffer.Slice(segment.Length + 1);
return true;
}
return false;
}
static bool TryReadComponent(ref ReadOnlySequence<byte> segment, out ReadOnlySequence<byte> component)
{
if (segment.Length == 0)
{
component = ReadOnlySequence<byte>.Empty;
return false;
}
SequenceReader<byte> reader = new SequenceReader<byte>(segment);
if (!reader.TryReadTo(out component, DataElementSeparator, delimiterEscape: ReleaseCharacter, advancePastDelimiter: true))
{
component = segment.Slice(0);
segment = segment.Slice(component.Length);
}
else
{
segment = segment.Slice(component.Length + 1);
}
return true;
}
static bool TryReadElement(ref ReadOnlySequence<byte> component, out ReadOnlySequence<byte> element)
{
if (component.Length == 0)
{
element = ReadOnlySequence<byte>.Empty;
return false;
}
SequenceReader<byte> reader = new SequenceReader<byte>(component);
if (!reader.TryReadTo(out element, ComponentDataElementSeparator, delimiterEscape: ReleaseCharacter, advancePastDelimiter: true))
{
element = component.Slice(0);
component = component.Slice(element.Length);
}
else
{
component = component.Slice(element.Length + 1);
}
return true;
}
}
@adminnz thanks for this! Feel free to add an additional EdiTextReader via pull request. That is what I was planning todo sometime in the future. Should create a specific version when on netcoreapp2.0
or netstandard 2.1
though.
That said, I am sure there is much room for improvements but the bottleneck I suspect is not located in the EdiTextReader but in the logic on the Deserializer. There is forward lookups taking place especially to support EdiConditions.
I took a look at your sample models and although they are working for your transmission there are many things that can be simplified. I will see into it this week if I find some time.
Regards,
C.
@adminnz So I run your data and models to another test with the following
- 10 million lines file
- 200 MB size
- one Group
- 174549 invoices
The memory consumption was steady and peaked at 620 MB which is good and seems there is no memory leak. But this took about 40 minutes which is not good. I will try to find some help and profile this because I am not a profiling expert and see where the hot parts are (probably GC).