NCsvPerf Add quoted values to the test CSV and possibly some larger blocks.

Just a suggestion, but having a column with quoted and quote escapes will round out the tests. I know it might break someone's code, but I realized when diving in that a 'row' is not a 'line' in the file. A 'row' can span multiple lines if it's within a quoted value. So having a newline character in a quoted value somewhere might be good to add to the tests.

Aug 15 '21 00:08 electricessence

I might be looking at the wrong CSV for your tests, if so, please close.

Aug 15 '21 01:08 electricessence

You're right, the current tests don't include multi-line data or even quote escaped fields. I don't remember my original motivation for this (perhaps it was just a coincidence) but I think the likelihood for diverging behavior/support across the various libraries goes up as more esoteric CSV features are used.

Aside from a performance perspective, it would be cool to test various libraries handling and support of some different nasty CSV files. I have been thinking about a big table of "CSV features" as columns and "CSV libraries" as rows, each cell would be a red x or green check. A data set we could use is https://github.com/maxogden/csv-spectrum.

Great idea :). If you're interested in adding a benchmark for these additional cases, you're more than welcome to it.

Aug 17 '21 18:08 joelverhagen

I think it's important to show the performance differences as you cannot simply capture a Span<char> of an escaped sequence, you have to reconstruct it as a result. You might have one lib that smokes when nothing is escaped but is crushed when dealing with a column that has quotes.

Aug 17 '21 19:08 electricessence

you cannot simply capture a Span<char> of an escaped sequence, you have to reconstruct it as a result

This is true, but one can use the ArrayPoolBufferWriter<> class to reconstruct the result and eliminate most, if not all, allocations.

Sep 30 '21 12:09 Kittoes0124

Related to this is benchmarking OpenSerDe definitions. I hate Amazon Athena / DynamoDB serialization, but it is what it is.

Sep 30 '21 19:09 jzabroski

@Kittoes0124 I'm curious how you are using ArrayPoolBufferWriter<char>. I'm not seeing how I would implement it if I'm already using a memory buffer and I even experimented with replacing StringBuilder to merge a string a of characters... But StringBuilder (if reused) is fast for concatenating chars together.

Oct 01 '21 22:10 electricessence

@electricessence Like so. The idea is that, when possible, we're always yielding a window into the original buffer. When "stitching" is required we write into the ArrayPoolBufferWriter<>, yield its buffer, clear it, and then reuse. My implementation is not strictly compatible with CSVs as defined in the RFC. Even so, neither is string.Split and it appears in the benchmarks. ReadDelimitedAsync is intended to be used as a building block for handling the concept of delimited files more generally.

As the kids are saying these days it uh... slaps.

[MemoryDiagnoser]
[SimpleJob(RunStrategy.Throughput)]
public class ReaderBenchmark
{
    const string InputFileName = ".../PackageAssets.csv";

    [Params(4096, 16384, 65536)]
    public int BufferSize { get; set; }

    [Benchmark(Baseline = true)]
    public async Task Simple() {
        using var inputStream = new FileStream(
            options: new FileStreamOptions {
                Access = FileAccess.Read,
                BufferSize = 1,
                Mode = FileMode.Open,
                Options = FileOptions.SequentialScan,
                Share = FileShare.Read,
            },
            path: InputFileName
        );

        await foreach (var line in PipeReader.Create(inputStream, new StreamPipeReaderOptions(bufferSize: BufferSize))
            .EnumerateAsync()
            .DecodeAsync()
            .ReadDelimited2dAsync(
                xDelimiter: (','),
                yDelimiter: ('\n')
            )
            .WithCancellation(cancellationToken: default)
            .ConfigureAwait(continueOnCapturedContext: false)
        ) {
        }
    }
}

Oct 01 '21 22:10 Kittoes0124

Also, adding massive CSVs with degenerate properties might be beneficial. I use the latest version of the NPPES downloadable file to really stress test my stuff because everything about it is unreasonable. It's 8GB unzipped, has 330 fields per row, every single field is quoted, and most of them are sparsely populated. Oh, and it's real-world data; which is a nice bonus!

Oct 01 '21 22:10 Kittoes0124

@Kittoes0124 where's this .EnumerateAsync() extension?

Oct 01 '21 23:10 electricessence

Where's this .EnumerateAsync() extension?

Right here.

Oct 01 '21 23:10 Kittoes0124

Maybe we should continue this discussion somewhere else. @Kittoes0124 Did you write those extensions or get them from somewhere else? If you wrote them, I'm impressed! I would have written them exactly the same. :) I'm confused why they're not built in! Or why there isn't a published lib for them. If you are a fan of Channels: https://github.com/Open-NET-Libraries/Open.ChannelExtensions Maybe there needs to be a similar extension lib for Pipelines.

That said... Pipeline overhead is not zero: https://www.reddit.com/r/dotnet/comments/q0lxxl/benchmark_obsession_pipereader_overhead/

Oct 03 '21 17:10 electricessence

To add more insult to injury. The overhead is not zero and maybe not great in ideal conditions.

Method	BufferSize	UseAsync	Mean	Error	StdDev	Median	Ratio	RatioSD
StreamReader_Read	4096	False	143.19 ms	2.651 ms	2.604 ms	142.73 ms	1.41	0.15
StreamReader_ReadAsync	4096	False	234.40 ms	4.021 ms	3.761 ms	234.32 ms	2.32	0.23
PipeReader_EnumerateAsync	4096	False	252.06 ms	4.658 ms	4.575 ms	252.21 ms	2.48	0.25
FileStream_Read	4096	False	105.96 ms	3.356 ms	9.736 ms	104.35 ms	1.00	0.00
FileStream_ReadAsync	4096	False	156.22 ms	3.113 ms	5.769 ms	156.54 ms	1.50	0.13

StreamReader_Read	4096	True	384.48 ms	10.109 ms	29.647 ms	384.46 ms	1.01	0.10
StreamReader_ReadAsync	4096	True	401.89 ms	12.897 ms	37.416 ms	407.07 ms	1.05	0.11
PipeReader_EnumerateAsync	4096	True	441.99 ms	8.837 ms	24.190 ms	440.23 ms	1.17	0.09
FileStream_Read	4096	True	382.73 ms	8.951 ms	26.110 ms	380.21 ms	1.00	0.00
FileStream_ReadAsync	4096	True	359.61 ms	7.141 ms	17.110 ms	361.55 ms	0.96	0.07

StreamReader_Read	16384	False	58.95 ms	2.092 ms	6.136 ms	58.61 ms	1.71	0.15
StreamReader_ReadAsync	16384	False	93.71 ms	1.845 ms	3.971 ms	94.38 ms	2.61	0.15
PipeReader_EnumerateAsync	16384	False	125.33 ms	2.882 ms	8.270 ms	123.85 ms	3.62	0.33
FileStream_Read	16384	False	34.78 ms	0.764 ms	2.180 ms	34.41 ms	1.00	0.00
FileStream_ReadAsync	16384	False	49.56 ms	1.313 ms	3.681 ms	48.65 ms	1.43	0.12

StreamReader_Read	16384	True	88.48 ms	1.758 ms	4.693 ms	88.05 ms	1.18	0.11
StreamReader_ReadAsync	16384	True	139.60 ms	2.732 ms	4.411 ms	140.68 ms	1.85	0.17
PipeReader_EnumerateAsync	16384	True	171.27 ms	3.242 ms	5.763 ms	172.72 ms	2.26	0.17
FileStream_Read	16384	True	75.93 ms	2.369 ms	6.603 ms	73.44 ms	1.00	0.00
FileStream_ReadAsync	16384	True	101.14 ms	1.927 ms	4.106 ms	102.07 ms	1.35	0.10

Oct 03 '21 17:10 electricessence

Maybe we should continue this discussion somewhere else. @Kittoes0124 Did you write those extensions or get them from somewhere else? If you wrote them, I'm impressed! I would have written them exactly the same. :) I'm confused why they're not built in! Or why there isn't a published lib for them. If you are a fan of Channels: https://github.com/Open-NET-Libraries/Open.ChannelExtensions Maybe there needs to be a similar extension lib for Pipelines.

That said... Pipeline overhead is not zero: https://www.reddit.com/r/dotnet/comments/q0lxxl/benchmark_obsession_pipereader_overhead/

Thanks mate, that means a lot as I did indeed write them myself! Also feel the same way about their lack of inclusion in the base lib. Will continue our discussion further in your Reddit post.

Oct 04 '21 03:10 Kittoes0124

NCsvPerf NCsvPerf copied to clipboard

Add quoted values to the test CSV and possibly some larger blocks.

NCsvPerf
NCsvPerf copied to clipboard