datalab Use pipelines?

Use pipelines?

Open ajcvickers opened this issue 3 years ago • 19 comments

There has been some discussion about whether use of data pipelines is the way to go for highly async, highly performant binary communication such as Tabular Data Stream (TDS) to and from SQL Server.

@davidfowl Thoughts? (I believe you discussed this with @roji already.)

@JamesNK We were wondering what gRPC uses?

Apr 23 '21 15:04 ajcvickers

Yes 😄

Apr 23 '21 15:04 davidfowl

Yes.

Apr 23 '21 15:04 Wraith2

I really hope and wish this can come true 👍🏻

In a production app (closed source for work) I see huge potential to further improve things just with such a change*. Further a lot of boxing (I need to store and read lots of doubles) could go away by writing them to the pipe directly.

* I almost started to prototyp such a change, but didn't had the time for it.

Apr 23 '21 15:04 gfoidl

Further a lot of boxing (I need to store and read lots of doubles) could go away by writing them to the pipe directly.

I got rid of a lot of primitives boxing in M.D.SqlClient, even for guids. Unless you're passing them through parameters they shouldn't be too bad. If you can provide a test repro that shows where they've boxing open an issue on M.D.SqlClient and I'll take a look.

Apr 23 '21 16:04 Wraith2

Grpc uses pipelines in asp.net core server and Stream in the client.

Apr 23 '21 19:04 JamesNK

@JamesNK am curious, was there a particular reason not to use pipelines in the client too?

Apr 24 '21 19:04 roji

HttpClient uses Stream in its APIs. Pipelines would be another layer of abstraction that wouldn’t add any value.

One of the nice things about pipelines is it manages buffer pooling for you. gRPC messages are prefixed by length so it isn’t difficult to rent a buffer of the right size yourself.

Apr 24 '21 20:04 JamesNK

Do we rent the exact length? That might not always be the best strategy...

Apr 24 '21 21:04 davidfowl

Currently yes. Agree that won't be best for big messages. Might look at using an IBufferWriter<byte> implementation that splits big messages into buffer segments.

Apr 24 '21 21:04 JamesNK

SqlClient uses SslStream and in general uses stream as an abstraction because the supported transports are TCP and NamedPipes both of which have stream abstractions but no lower.

If we decide not to support NamedPipes so that we can go down as far as sockets we'll need to find a way to support Ssl at the socket level (I haven't seen a way to do that).

Even having Pipelines wrapping streams will have code level benefits even if it doesn't allow perfect array pooling. The TDS protocol has 3 layers to each packet and we might not want to deal with all the complexity at every level. For example, a MARS packet starts with an SMUX header then it has the packet header and then it has the payload. If you're reading an int from payload bytes you'd really want to ask a payload data pipeline for 4 bytes and not care if they cross packet boundaries or not. To do that we'd have to hide the smux and packet headers by passing a memory segment up to the higher level pipe and then consume it from the lower pipe when the upper consumes it all.

Apr 24 '21 21:04 Wraith2

SqlClient uses SslStream and in general uses stream as an abstraction because the supported transports are TCP and NamedPipes both of which have stream abstractions but no lower.

If we decide not to support NamedPipes so that we can go down as far as sockets we'll need to find a way to support Ssl at the socket level (I haven't seen a way to do that).

Both sockets and named pipes are eventually accessed via Stream implementations (over which SslStream can be layered), so I'm not sure why supporting named pipes or not is important here.

Even having Pipelines wrapping streams will have code level benefits even if it doesn't allow perfect array pooling. The TDS protocol has 3 layers to each packet and we might not want to deal with all the complexity at every level.

I don't know anything about TDS/SMUX or anything like that, but when looking some time ago, I indeed got the impression that PIpelines are especially useful when the message length isn't known in advance (e.g. when reading HTTP data, which is newline-terminated rather than length-prefixed). I might be totally wrong about that - @davidfowl is the guy who knows. I'm not sure if the multiple layering of protocols within TDS is important here - if at some level there's the notion of a length-prefixed packet or frame, than at that level that can be read and buffered (if needed) even if within it other protocols are nested.

I guess I'm asking whether we're sure that the benefits of Pipelines in the particular case of TDS are worth what we'd potentially pay for the additional I/O layer (including locking, IIRC). Again, @davidfowl is the guy :)

Apr 25 '21 13:04 roji

There are complications but overall the benefits of pipelines are well worth any effort we have to put into using them. I think they fit well with the way we want to think about and work with the tds data. We don't want to think in packets we want to abstract that away and focus on what we're doing with the contents.

I've written some code with pipelines to parse a text based network protocol and it's low level but well worth the perf benefits. In fact that project was what started me working on SqlClient. I wrote a really high performance implementation of the parsing and object layers and then found that the moment I started trying to access the database it absolutely destroyed perf and I thought I might be able to fix that. Still working on it 😀

Apr 25 '21 14:04 Wraith2

We don't want to think in packets we want to abstract that away and focus on what we're doing with the contents.

If you're referring to TCP packets, then that's already abstracted away at a lower level by the OS - the Stream abstraction allows you to be oblivious of packets and treat everything as a linear stream of bytes. If you're referring to packets in TDS (or one of its subprotocols or something), then that's something that's going to have to happen in a TDS driver regardless, no? The only question is whether Pipelines is the technology you want to use or not.

I've written some code with pipelines to parse a text based network protocol and it's low level but well worth the perf benefits.

Pipelines definitely does seem to shine for parsing text-based protocols; but here we're discussing a length-prefixed one, which is why I'm raising the question.

Apr 25 '21 14:04 roji

Not quite. @Wraith2 hit the nail on the head with the use case for pipelines and this is the ideal solution, length prefixed or not (HTTP/2 isn't text based). See github.com/davidfowl/BedrockFramework/ if you want to see more scenarios.

Apr 25 '21 15:04 davidfowl

If you're referring to TCP packets

Nope. [MS-TDS]: Tabular Data Stream Protocol is a packetized binary protocol. It is sometimes contained within [MC-SMP]: Session Multiplex Protocol which is sometimes called SMUX.

At login the client requests a max packet size and the server will respond with the max packet size it will allow. That negotiated packet size it will allow for the connection. Thereafter data sent from either side will be packed together in blocks no larger than that max packet size and only the final packet of a message is allowed to be smaller than that size.

These TDS packets are typically 8K but can go up to almost 16 when ssl is used. Each packet will start with a small number of bytes for the SNUX header if it is in use, then the packet header, then the payload. Any data type larger than a byte can span multiple packet payloads with no additional metadata, you'll simply get 3 bytes at the end of one payload and then the next byte as the first in the next payload.

Because of that structure what you really want to be able to do is request data from the payload(s) and not have to care about the mechanics of parsing out the smux or packet headers. I want to ask for 4 bytes and get an OperationStatus enum value back telling me if it succeeded and if not why not.

In the existing SqlClient codebase you've got this:

       internal bool TryReadInt32(out int value)
        {
            TdsParser.ReliabilitySection.Assert("unreliable call to ReadInt32");  // you need to setup for a thread abort somewhere before you call this method
            Debug.Assert(_syncOverAsync || !_asyncReadWithoutSnapshot, "This method is not safe to call when doing sync over async");
            if (((_inBytesUsed + 4) > _inBytesRead) || (_inBytesPacket < 4))
            {
                // If the int isn't fully in the buffer, or if it isn't fully in the packet,
                // then use ReadByteArray since the logic is there to take care of that.

                if (!TryReadByteArray(_bTmp, 0, 4))
                {
                    value = 0;
                    return false;
                }

                AssertValidState();
                value = BitConverter.ToInt32(_bTmp, 0);
                return true;
            }
            else
            {
                // The entire int is in the packet and in the buffer, so just return it
                // and take care of the counters.

                value = BitConverter.ToInt32(_inBuff, _inBytesUsed);

                _inBytesUsed += 4;
                _inBytesPacket -= 4;

                AssertValidState();
                return true;
            }
        }

Where you can see that it's dealing with both data and packet semantics at the same level. What isn't obvious is that the TryReadByteArray is going to do blocking IO in sync mode or possibly outright fail in async mode. IO, packet header decode and data decode are all handled at the same layer everywhere and it's really confusing as well as requiring lots of intricate interlinked state management.

With an input pipeline you'd reduce that complexity through proper abstraction and make each layer more usable readable and testable. We want that.

I'm getting into a lot of detail but I've given this quite a lot of consideration of the last couple of years and I believe that pipelines are something that we want to use. We may not be able to use them at the IO layer because of the SSL requirements but even if we can't they're usable beyond that and I think we should.

Apr 25 '21 17:04 Wraith2

@Wraith2 it sounds to me like you want a Stream implementation over your multiple packets, which takes care of that layer and exposes the payload. The question is what happens in the payload, and whether messages in there are length-prefixed or not, etc.

But it seems both @Wraith2 and @davidfowl are saying Pipelines are the way to go, so let's move on to something else.

Apr 25 '21 19:04 roji

Once you've used it, it'll feel like a no brainer 😉

Apr 25 '21 19:04 davidfowl

... I wrote a really high performance implementation of the parsing and object layers and then found that the moment I started trying to access the database it absolutely destroyed perf and I thought I might be able to fix that. Still working on it 😀

@Wraith2 Are you planning to try again with in-memory tables? I'd love to see initial or early focus of this effort be SqlBulkInsert to SQL in-memory tables. That scenario is currently a confusing mess, could use some love from .NET layer. Especially on allocations (DataTable is SOP, or object[] with boxed longs & doubles in them). :)

Pointer to better place to discuss such scenarios?

May 24 '21 16:05 yzorg

Open an issue with details and repro case on https://github.com/dotnet/SqlClient

May 24 '21 18:05 Wraith2

datalab datalab copied to clipboard

Use pipelines?

datalab
datalab copied to clipboard