EDI.Net Performance & memory usage

I'm having to parse 500+MB edifact INVOIC files, but I'm having issues with even little files that only 10MB. I see the sample files are all in the 1-5KB range (I unfortunately cannot include a sample file, as it contains sensitive information.).

I would like to know have you had any success with this library on files of this size ?

Mar 30 '20 10:03 adminnz

This is interesting. I have used files of about 5-15 megs for my own clients in the past. I think it is easy to reproduce and find the bottleneck if we create a test for this by copy pasting a loop on one of my existing tests. One last question, which framework and version are you using (net45, net47, netcoreapp2.1, netcoreapp2.2, netcoreapp3.1) x32 or x64

Mar 30 '20 11:03 cleftheris

One more thing are you experiencing a System out of memory (you can check you EventLog for that). What exactly is the issue.

Mar 30 '20 11:03 cleftheris

I just tested on an average machine on dotnetcore 3.1 and it took about 3 min to load & deserialize a file of

about 22 Megs
a million rows
78880 Messages in the transaction

Memory consumption peaked around 150 megs. I did not notice in my simple test any weird memory behavior or indication of memory leak.

In order to help more you need to provide with a sample transmission with only a couple of messages inside it (that you could obfuscate their content by hand off-course) and your model and I will bloat it into a huge file and see where the problem is.

Thanks,

C.

Mar 30 '20 15:03 cleftheris

Sample.zip

Maybe my models are not defined correctly? So i've attached a sample project that contains the models (interchange.cs) and a sample edi file (small).

Note the actual files I'm processing have ratios as follows (names match model classes): 1 Interchange and 1 Group The Group having 5000 Messages . Each Message having 1 Invoice. Each Invoice having 500,000 Line Items

Mar 31 '20 04:03 adminnz

Another point related to this, is that you said it took 3 minutes to load & deserialize a 22Mb file. That sounds rather slow, considering that with some basic code its possible to iterate over every Segment/Element/Value in a 300+MB file in less than 20 seconds. Obviously there is more to it than just reading the file, but I would not expect the mapping to have such a massive impact on processing.

Below is some sample code of reading each part of an edifact file using the System.Buffers & Pipelines features of .NET

using System;
    using System.Buffers;
    using System.IO;
    using System.IO.Pipelines;
    using System.Text;
    using System.Threading.Tasks;
    class Program
    {
        const byte SegmentTerminator = (byte)'\'';
        const byte DataElementSeparator = (byte)'+';
        const byte ComponentDataElementSeparator = (byte)':';
        const byte ReleaseCharacter = (byte)'?';
        static async Task Main(string[] args)
        {
            if (args.Length < 1)
            {
                args = new string[1];
                args[0] = Console.ReadLine();
            }
            int segment_count = 0;
            var stopwatch = System.Diagnostics.Stopwatch.StartNew();
            var encoding = Encoding.UTF8;
            using (var filestream = new FileStream(args[0], FileMode.Open, FileAccess.Read))
            {
                var reader = PipeReader.Create(filestream);
                while (true)
                {
                    var result = await reader.ReadAsync();
                    var buffer = result.Buffer;

                    while (TryReadSegment(ref buffer, out var segment))
                    {
                        segment_count++;
                        //print out segment if we want   Console.WriteLine(encoding.GetString(segment.ToArray()));
                        while (TryReadComponent(ref segment, out var component))
                        {
                            while (TryReadElement(ref component, out var element))
                            {

                            }
                        }
                    }
                    reader.AdvanceTo(buffer.Start, buffer.End);
                    if (result.IsCompleted) break;
                }
                await reader.CompleteAsync();
                stopwatch.Stop();
            }
            Console.WriteLine("Segments: {0:N0}", segment_count);
            Console.WriteLine("Duration: {0}ms", stopwatch.ElapsedMilliseconds);
            Console.ReadKey();
        }

        static bool TryReadSegment(ref ReadOnlySequence<byte> buffer, out ReadOnlySequence<byte> segment)
        {
            if (buffer.Length == 0)
            {
                segment = ReadOnlySequence<byte>.Empty;
                return false;
            }
            SequenceReader<byte> reader = new SequenceReader<byte>(buffer);
            if (reader.TryReadTo(out segment, SegmentTerminator, delimiterEscape: ReleaseCharacter, advancePastDelimiter: true))
            {
                buffer = buffer.Slice(segment.Length + 1);
                return true;
            }
            return false;
        }

        static bool TryReadComponent(ref ReadOnlySequence<byte> segment, out ReadOnlySequence<byte> component)
        {
            if (segment.Length == 0)
            {
                component = ReadOnlySequence<byte>.Empty;
                return false;
            }
            SequenceReader<byte> reader = new SequenceReader<byte>(segment);
            if (!reader.TryReadTo(out component, DataElementSeparator, delimiterEscape: ReleaseCharacter, advancePastDelimiter: true))
            {
                component = segment.Slice(0);
                segment = segment.Slice(component.Length);
            }
            else
            {
                segment = segment.Slice(component.Length + 1);
            }
            return true;
        }
        static bool TryReadElement(ref ReadOnlySequence<byte> component, out ReadOnlySequence<byte> element)
        {
            if (component.Length == 0)
            {
                element = ReadOnlySequence<byte>.Empty;
                return false;
            }
            SequenceReader<byte> reader = new SequenceReader<byte>(component);
            if (!reader.TryReadTo(out element, ComponentDataElementSeparator, delimiterEscape: ReleaseCharacter, advancePastDelimiter: true))
            {
                element = component.Slice(0);
                component = component.Slice(element.Length);
            }
            else
            {
                component = component.Slice(element.Length + 1);
            }
            return true;
        }
    }

Apr 06 '20 03:04 adminnz

@adminnz thanks for this! Feel free to add an additional EdiTextReader via pull request. That is what I was planning todo sometime in the future. Should create a specific version when on netcoreapp2.0 or netstandard 2.1 though.

That said, I am sure there is much room for improvements but the bottleneck I suspect is not located in the EdiTextReader but in the logic on the Deserializer. There is forward lookups taking place especially to support EdiConditions.

I took a look at your sample models and although they are working for your transmission there are many things that can be simplified. I will see into it this week if I find some time.

Regards,

C.

Apr 06 '20 15:04 cleftheris

@adminnz So I run your data and models to another test with the following

10 million lines file
200 MB size
one Group
174549 invoices

The memory consumption was steady and peaked at 620 MB which is good and seems there is no memory leak. But this took about 40 minutes which is not good. I will try to find some help and profile this because I am not a profiling expert and see where the hot parts are (probably GC).

Apr 09 '20 13:04 cleftheris

EDI.Net EDI.Net copied to clipboard

Performance & memory usage

EDI.Net
EDI.Net copied to clipboard