NCsvPerf
NCsvPerf copied to clipboard
Add quoted values to the test CSV and possibly some larger blocks.
Just a suggestion, but having a column with quoted and quote escapes will round out the tests. I know it might break someone's code, but I realized when diving in that a 'row' is not a 'line' in the file. A 'row' can span multiple lines if it's within a quoted value. So having a newline character in a quoted value somewhere might be good to add to the tests.
I might be looking at the wrong CSV for your tests, if so, please close.
You're right, the current tests don't include multi-line data or even quote escaped fields. I don't remember my original motivation for this (perhaps it was just a coincidence) but I think the likelihood for diverging behavior/support across the various libraries goes up as more esoteric CSV features are used.
Aside from a performance perspective, it would be cool to test various libraries handling and support of some different nasty CSV files. I have been thinking about a big table of "CSV features" as columns and "CSV libraries" as rows, each cell would be a red x or green check. A data set we could use is https://github.com/maxogden/csv-spectrum.
Great idea :). If you're interested in adding a benchmark for these additional cases, you're more than welcome to it.
I think it's important to show the performance differences as you cannot simply capture a Span<char>
of an escaped sequence, you have to reconstruct it as a result. You might have one lib that smokes when nothing is escaped but is crushed when dealing with a column that has quotes.
you cannot simply capture a
Span<char>
of an escaped sequence, you have to reconstruct it as a result
This is true, but one can use the ArrayPoolBufferWriter<> class to reconstruct the result and eliminate most, if not all, allocations.
Related to this is benchmarking OpenSerDe definitions. I hate Amazon Athena / DynamoDB serialization, but it is what it is.
@Kittoes0124 I'm curious how you are using ArrayPoolBufferWriter<char>
. I'm not seeing how I would implement it if I'm already using a memory buffer and I even experimented with replacing StringBuilder
to merge a string a of characters... But StringBuilder (if reused) is fast for concatenating chars together.
@electricessence Like so. The idea is that, when possible, we're always yielding a window into the original buffer. When "stitching" is required we write into the ArrayPoolBufferWriter<>
, yield its buffer, clear it, and then reuse. My implementation is not strictly compatible with CSVs as defined in the RFC. Even so, neither is string.Split
and it appears in the benchmarks. ReadDelimitedAsync
is intended to be used as a building block for handling the concept of delimited files more generally.
As the kids are saying these days it uh... slaps.
[MemoryDiagnoser]
[SimpleJob(RunStrategy.Throughput)]
public class ReaderBenchmark
{
const string InputFileName = ".../PackageAssets.csv";
[Params(4096, 16384, 65536)]
public int BufferSize { get; set; }
[Benchmark(Baseline = true)]
public async Task Simple() {
using var inputStream = new FileStream(
options: new FileStreamOptions {
Access = FileAccess.Read,
BufferSize = 1,
Mode = FileMode.Open,
Options = FileOptions.SequentialScan,
Share = FileShare.Read,
},
path: InputFileName
);
await foreach (var line in PipeReader.Create(inputStream, new StreamPipeReaderOptions(bufferSize: BufferSize))
.EnumerateAsync()
.DecodeAsync()
.ReadDelimited2dAsync(
xDelimiter: (','),
yDelimiter: ('\n')
)
.WithCancellation(cancellationToken: default)
.ConfigureAwait(continueOnCapturedContext: false)
) {
}
}
}
Also, adding massive CSVs with degenerate properties might be beneficial. I use the latest version of the NPPES downloadable file to really stress test my stuff because everything about it is unreasonable. It's 8GB unzipped, has 330 fields per row, every single field is quoted, and most of them are sparsely populated. Oh, and it's real-world data; which is a nice bonus!
@Kittoes0124 where's this .EnumerateAsync()
extension?
Maybe we should continue this discussion somewhere else. @Kittoes0124 Did you write those extensions or get them from somewhere else? If you wrote them, I'm impressed! I would have written them exactly the same. :) I'm confused why they're not built in! Or why there isn't a published lib for them. If you are a fan of Channels: https://github.com/Open-NET-Libraries/Open.ChannelExtensions Maybe there needs to be a similar extension lib for Pipelines.
That said... Pipeline overhead is not zero: https://www.reddit.com/r/dotnet/comments/q0lxxl/benchmark_obsession_pipereader_overhead/
To add more insult to injury. The overhead is not zero and maybe not great in ideal conditions.
Method | BufferSize | UseAsync | Mean | Error | StdDev | Median | Ratio | RatioSD |
---|---|---|---|---|---|---|---|---|
StreamReader_Read | 4096 | False | 143.19 ms | 2.651 ms | 2.604 ms | 142.73 ms | 1.41 | 0.15 |
StreamReader_ReadAsync | 4096 | False | 234.40 ms | 4.021 ms | 3.761 ms | 234.32 ms | 2.32 | 0.23 |
PipeReader_EnumerateAsync | 4096 | False | 252.06 ms | 4.658 ms | 4.575 ms | 252.21 ms | 2.48 | 0.25 |
FileStream_Read | 4096 | False | 105.96 ms | 3.356 ms | 9.736 ms | 104.35 ms | 1.00 | 0.00 |
FileStream_ReadAsync | 4096 | False | 156.22 ms | 3.113 ms | 5.769 ms | 156.54 ms | 1.50 | 0.13 |
StreamReader_Read | 4096 | True | 384.48 ms | 10.109 ms | 29.647 ms | 384.46 ms | 1.01 | 0.10 |
StreamReader_ReadAsync | 4096 | True | 401.89 ms | 12.897 ms | 37.416 ms | 407.07 ms | 1.05 | 0.11 |
PipeReader_EnumerateAsync | 4096 | True | 441.99 ms | 8.837 ms | 24.190 ms | 440.23 ms | 1.17 | 0.09 |
FileStream_Read | 4096 | True | 382.73 ms | 8.951 ms | 26.110 ms | 380.21 ms | 1.00 | 0.00 |
FileStream_ReadAsync | 4096 | True | 359.61 ms | 7.141 ms | 17.110 ms | 361.55 ms | 0.96 | 0.07 |
StreamReader_Read | 16384 | False | 58.95 ms | 2.092 ms | 6.136 ms | 58.61 ms | 1.71 | 0.15 |
StreamReader_ReadAsync | 16384 | False | 93.71 ms | 1.845 ms | 3.971 ms | 94.38 ms | 2.61 | 0.15 |
PipeReader_EnumerateAsync | 16384 | False | 125.33 ms | 2.882 ms | 8.270 ms | 123.85 ms | 3.62 | 0.33 |
FileStream_Read | 16384 | False | 34.78 ms | 0.764 ms | 2.180 ms | 34.41 ms | 1.00 | 0.00 |
FileStream_ReadAsync | 16384 | False | 49.56 ms | 1.313 ms | 3.681 ms | 48.65 ms | 1.43 | 0.12 |
StreamReader_Read | 16384 | True | 88.48 ms | 1.758 ms | 4.693 ms | 88.05 ms | 1.18 | 0.11 |
StreamReader_ReadAsync | 16384 | True | 139.60 ms | 2.732 ms | 4.411 ms | 140.68 ms | 1.85 | 0.17 |
PipeReader_EnumerateAsync | 16384 | True | 171.27 ms | 3.242 ms | 5.763 ms | 172.72 ms | 2.26 | 0.17 |
FileStream_Read | 16384 | True | 75.93 ms | 2.369 ms | 6.603 ms | 73.44 ms | 1.00 | 0.00 |
FileStream_ReadAsync | 16384 | True | 101.14 ms | 1.927 ms | 4.106 ms | 102.07 ms | 1.35 | 0.10 |
Maybe we should continue this discussion somewhere else. @Kittoes0124 Did you write those extensions or get them from somewhere else? If you wrote them, I'm impressed! I would have written them exactly the same. :) I'm confused why they're not built in! Or why there isn't a published lib for them. If you are a fan of Channels: https://github.com/Open-NET-Libraries/Open.ChannelExtensions Maybe there needs to be a similar extension lib for Pipelines.
That said... Pipeline overhead is not zero: https://www.reddit.com/r/dotnet/comments/q0lxxl/benchmark_obsession_pipereader_overhead/
Thanks mate, that means a lot as I did indeed write them myself! Also feel the same way about their lack of inclusion in the base lib. Will continue our discussion further in your Reddit post.