datalab
datalab copied to clipboard
SqlServer.Core: Performance-oriented SQL Server .NET driver
An experiment in collaboration with the community to determine what potential there is modern .NET features in a highly performant SQL Server driver.
@ErikEJ @wraith2 @NickCraver @mgravell @Drawaes: Let's get this party started!
@smitpatel, @roji, and I discussed how to get started with Woodstar, and we have created some initial issues for the ideas we came up with. We want this to be collaborative, so please add your ideas, either by commenting on issues or in new issues as appropriate.
Initial ideas:
- Set up build and perf testing infrastructure
- Implement a simple query scenario using Microsoft.Data.SqlClient
- Hack simple query scenario to get upper bound for TDS/SQL Server
/cc @davidfowl
Would it be useful to list current pain points, especially w.r.t. performance in the current driver structure? Discussing connection pooling and what we are/aren't able to see or get metrics on today due to what's exposed and what's internal is something that doesn't hurt so much per connection but is a pain at scale in various ways. Today, we're wrapping connections to make best-guess workarounds for how pooling is behaving, which commands are in flight when a stall happens, etc.
I know out goal is performance, but I think it's worth considering the driver/implementation surface area along these lines as we go, and maybe current pain points is the good thing to have as a starting point on that front?
- Async strings and blobs. The current internal structure makes their performance non-linear. It is a long project to fix it but it's complicated and there is no mitigation other than using SequentialAccess mode and constructing the string yourself.
- Task based internal architecture uses almost entirely imperative task classes rather than langauge supported and this generates a lot of garbage in Tasks, TaskCompletionSources, ContinueWith closures etc. This could all be langauge and valuetask based and perform faster and cleaner
- The codebases mixes sync and async paths liberally and confusingly. Along with lots of
lock
ing this makes making logical deductions about safety hard. Atomic state changes would be a lot easier to work with - The code is complex and hard to follow with very little explanation for internal choices. e.g., why when connecting to a named address does it choose the last ip4 address returned from the dns to try first? If we change it do we break someone?
- Parameters being non-generic forces the use of boxing object containers and leads to reflection based runtime type discovery and conversion.
- Connection pool keys being canonicalized and stored as strings incurs overhead because every connection because the string has to be generated each time. A dedicated immutable type would be better.
- The SNI (Sql Network Interface) layer) and Tds Layer are disconnected loops and every packet received is copied at least twice, at least half of the byte arrays used are dropped to GC because the lifetime isn't tracked so they aren't known to be safe to re-use.
Plus a whole load of other random bits and pieces that when/if we encounter the same problem in a new implementation I will hopefully spot and warn about so we can avoid the same implementation choice. In general it's about what you expect from a 20 year old codebase at this point. Decisions which were correct at the time are now outdated but the weight of the codebase is so high that changing direction carries unacceptably high risk. Writing from a high level async-first perspective will get us an easier codebase to work with.
Should async-only be a design goal here, like HttpClient in .NET Core before...the happening?
I'd prefer to support as much of the existing external surface area as possible because it will enable widest adoption if we get to production stages. I'd say async first but not async only if that's possible, I think it is possible because postgres manages it afaik.
I'd hope we'd build the core engine with a different API shape than what ADO.NET wants. We can write a slow adapter for that layer but the new APIs shouldn't be bound by it. Also take a look at this issue https://github.com/dotnet/runtime/issues/28882 as a way to avoid boxing for DbParameters.
The problem with using another shape is knowing what it is, there can be no consumers until we write one so we don't know what shape is required without a specific and well defined set of consumer requirements, who's the first consumer?
Well you have SO people commenting right here. The point I thought is to push the boundaries of performance and not support the widest number of use cases. As David has said slower compatibility could be built on top. In reality though in the case of sync you are just hiding what is really going on as all db operations are async in nature if there is network or any disk access involved.
Ok, pure perf it is and then we'll see. That'll allow ignoring all the existing public surface area defined in terms of Task and using ValueTask throughout which will be lovely.
You say that all db operations are async. All IO operations are async and if you meant that then I agree. However because TDS is packetized there can be multiple columns and even rows in a single packet so a single data fetch may not be async it could just be fetched from memory. At the moment in SqlClient you either have to allocate a task every time or you have to use sync and live with the thread blocking sometimes. ValueTask<T> GetFieldValueAsync(int index)
will make that problem go away.
We may also want to think about being clever with pre-dispatching packet requests. Currently you run through the packet until you run out of data and then request more and wait for the response. If you can guess based on reading, a string for example, that you won't have enough data you could pre dispatch the next packet request and then it might be ready when you need it. A fundamental limitation of any library like this is network latency and we might be able to hide some of that.
My vote here would be to only do a non-ADO.NET provider if figures clearly show ADO.NET is a serious blocker for performance. My Npgsql experience is telling me that ADO.NET in itself shouldn't be a (significant) blocker for high-perf scenarios, and doing a non-ADO.NET driver basically means it's unusable by all existing .NET layers (Dapper, EF, LLBLGen Pro...). Of course we can always build an ADO.NET adapter on top of a non-ADO.NET adapter, but I wouldn't necessarily go in the direction of "break the ecosystem" until we see a good reason to.
Note that while doing optimization work on EF Core, I've recently seen that saving an allocation here and there has less of an impact on end RPS perf than expected (by me at least), and in at least some cases ADO.NET can be improved upon too (e.g. https://github.com/dotnet/runtime/issues/17446 for parameter boxing). In other words, if some aspects of ADO.NET do turn out to block perf, I'd rather this effort helped us identify and fix them, rather than doing something completely different.
However because TDS is packetized there can be multiple columns and even rows in a single packet so a single data fetch may not be async it could just be fetched from memory. At the moment in SqlClient you either have to allocate a task every time or you have to use sync and live with the thread blocking sometimes. ValueTask<T> GetFieldValueAsync(int index) will make that problem go away.
This is a good example. If the Task allocation from DbDataReader.GetFieldValue turns out to be significant (and let's see that it's the case, rather than assuming it is), we could simply add a new ValueTask-returning API alongside the existing one. We could then promote that into ADO.NET.
Regardless of the above, I think an async-only driver definitely makes sense, at least as at a first phase (but maybe in general). Needing to support sync already creates various problems in today's world (e.g. no Pipelines support, at least last time I checked).
If the Task allocation from DbDataReader.GetFieldValue turns out to be significant (and let's see that it's the case, rather than assuming it is)
Well, I think so. This is from the TE Fortunes benchmark test project that I've been using for perf improvements over the last couple of years.
And since you're a bunch of people who'll know what you're looking at here's the dotTrace file for that so you can explore. BenchmarkDb.zip
The thing we actually want is the Fortune object and the strings it contains, look how far down the list it is. I'm pretty sure that the entire query result is inside a single packet response so it's all in memory at the point when fields are being fetched. Allocating a Task for every int32 read from an in memory packet isn't the worst things happening but because of the public surface area even if I managed to make the internals ValueTask based I'd still have the allocation at the user surface. I've considered proposing ValueTask<T> ValueGetFieldValueAsync(int i)
a couple of times but the name makes me have second thoughts let alone the process or approval and years it would take to get it adopted.
Anyway. You raise another good point @roji. What do we mean by high performance? The reason I started on this was that I wrote a really nicely performing server app and then the perf got destroyed by simply trying to save data to the database. I'd like to be able to plug in the database layer without it being the worst performing thing in the application.
In my experience so far query speed is dominated by network latency and we can't really do a lot about that. This is why memory makes a very small difference quite a lot of the time. What we miss in this scenario is that there will be higher layers doing other work like layout etc.
I think we're looking to get a driver which is as cpu and memory clean in as many aspects as possible so that many machine resources can be used simultaneously, so don't block threads don't waste memory causing time to be lost to GC. Get work done and get out of the way so higher layers can get higher throughput per unit hardware per time.
This is from the TE Fortunes benchmark test project that I've been using for perf improvements over the last couple of years.
We're getting down into the little details, but... I can't see any reason for GetFieldValueAsync to be used in TE Fortunes (or generally a big majority of ADO.NET usage). Unless one is reading very big rows and using CommandBehavior.SequentialAccess, the row is pretty much guaranteed to be buffered in memory; after all, without SequentialAccess one can read columns in any order. In that scenario, using GetFieldValueAsync is pretty much useless, and GetFieldValue should be used instead.
But if we want to add another API just to remove the Task allocation for when huge rows and SequentialAccess are being used, then it really isn't that complicated - mostly a matter of finding the right name. I think it's a pretty low-priority scenario (if you're reading huge rows, that extra allocation isn't likely to matter), but it's trivial to push through.
Re the rest... I do hope Npgsql shows that very good perf is possible while sticking to ADO.NET.
This is why memory makes a very small difference quite a lot of the time.
I'd want to see some backing to that assumption. Of course, I don't mean egregious useless allocations inside the driver (which is what SqlClient does in some cases) - that should simply be avoided or cleaned up. But again, my recent experience with EF Core shows that hunting down every single reasonable allocation has quickly diminishing returns, and therefore probably doesn't justify dumping ADO.NET.
To summarize, .NET does have a standard database API. While it's not the best it could be, there's an entire ecosystem around it, and very good performance has been achieved without dumping it. Could a bit more perf be squeezed by saving a couple allocations that ADO.NET forces us to do? Maybe. Does that justify doing something completely non-standard, where maybe ADO.NET itself could be fixed? I think that remains to be shown.
I don't understand the point of this repo if you just want to improve the sql client under ado.net? You might as well work on fixing that under the hood. SslStream was a similar age, very risky, a mix of sync and async and promise based code and we fixed it from the inside. There is no point just making another ado.net sql lib with all those restrictions.
I'm on @Wraith2 's side, that profile shows so much waste, like nuke all of those allocations😄
In MySqlConnector, I've considered providing a "direct" API that exposes the raw MySQL protocol through a .NET API to see how much overhead ADO.NET contributes. However, it always seemed extremely unlikely that it would be used anywhere except the TE benchmarks due to the established ADO.NET ecosystem.
I have no objections to this project pursuing a focused SQL Server solution to see what the maximum possible performance is, but would personally appreciate the effort going into directions that could benefit all ADO.NET providers/users. But please just take this as my 2c, and not a strong attempt to change the project direction.
Some ADO.NET pain points I'd like to see fixed:
- No first-class "connection pool" concept. Every ADO.NET user has to create (and throw away) a
DbConnection
object that behind-the-scenes borrows an actual connection from a pool (and returns it whenDbConnection.Dispose
is called). This results in a lot of temporary garbage objects when it would be nicer to writepool.BorrowConnection
(or similar) instead. Obviously that would be a big change in how ADO.NET is consumed.- Having a common, well-tested, low-overhead, lock-free connection pool available to use as an ADO.NET driver implementer would be very nice.
- Connection Strings. Related to above, the pool is usually implicitly selected by setting
DbConnection.ConnectionString
. This is frequently parsed withDbConnectionStringBuilder
, which adds some unavoidable overhead: https://github.com/mysql-net/MySqlConnector/issues/238. If a "connection pool" object were created, connection string parsing would only have to happen once, instead of per-DbConnection
. (Drivers can do some optimisations right now based on connection string equality, of course.) -
DbCommand
,DbParameterCollection
, etc. are a lot of ephemeral objects created for simple SQL statements. Dapper provides anExecuteScalarAsync<T>
extension method onDbConnection
; it might be nice to have something similar as part of ADO.NET that could do some optimised parameter binding, execute the SQL, and return the single scalar result. -
DbCommand.Prepare
doesn't make a lot of sense on an ephemeral object if it's intended to create a reusable server-side object. Usually this ends up storing some state on the underlying object that represents an open server connection.
OTOH, if a lot of these (and similar) suggestions get implemented, maybe it's "not ADO.NET" anymore, and trying to preserve ADO.NET was a false goal? 😀
This is my 2c as well. But I see it as an opportunity to write a new fresh DB surface area that while focused on Sql Server initially there is no reason there couldn't be a MySql etc version. Connection pooling is a very good example that I think shoe horning into existing ado.net would extra pain.
I would love to see a break from tradition of the example of pipelines vs streams. Initially built on their own and eventually bridging code to help you use pipelines with stream apis. But for pure perf you want pipelines all the way through.
To be clear, I'm not the boss here or anything - am just stating my opinions!
I don't understand the point of this repo if you just want to improve the sql client under ado.net? You might as well work on fixing that under the hood. SslStream was a similar age, very risky, a mix of sync and async and promise based code and we fixed it from the inside. There is no point just making another ado.net sql lib with all those restrictions.
There are various reasons why this isn't feasible - lots of backwards-compatibility, legacy and stability constraints simply don't make this possible. As a very simple example, SqlClient needs to continue supporting .NET Framework.
But I see it as an opportunity to write a new fresh DB surface area that while focused on Sql Server initially there is no reason there couldn't be a MySql etc version.
So there really are two efforts being discussed here:
- Writing a new, modern and efficient driver for SQL Server; this indeed resembles rewriting SslStream while retaining the same public API. There are indications that this would be of immense value in the ecosystem - and this is the original/main thing discussed here.
- Writing a new .NET database API, as an alternative to ADO.NET.
I see these as two very separate and orthogonal efforts, each being quite huge. For example, discussing a first-class, cross-database pooling component as @bgrainger mentioned (see https://github.com/dotnet/runtime/issues/24856) is just on its own a potentially huge debate. I believe having focus on one of the above two is important (trying to tackle too frequently backfires), but again, that's just my opinion.
I'm guess I'm also a bit confused on why we're proposing to do a new database API. Are we convinced ADO.NET is a significant blocker to database driver perf, and if so, that we can't fix those problems incrementally? Do we have any data to support that? Or do we just hate the API and want a modern, new API (also legitimate)? Or do we just want to play around without any constraints (also legitimate)? The important point is that any new DB API breaks the ecosystem in both directions - DB drivers would need to be rewritten to support it (Npgsql, MySqlConnector, Oracle), and upper layers would have to do the work to support it as well (EF Core, Dapper, LLBLGen Pro...). The effort would be immense - and I'm not clear on exactly why we'd be proposing that.
@bgrainger FWIW I agree with a lot of your points above, though some things can be handled without API changes. For example, Npgsql does avoid parsing the connection string more than once; DbCommand instances can be recycled by their owning DbConnection, and similarly DbDataReader can be recycled by their owning DbCommand (Npgsql does these things); we can possibly do something similar
But to reiterate - I'm just a guy expressing my opinions. If we end up with a non-ADO.NET driver, and an ADO.NET shim over that, maybe that's fine too.
I'm on @Wraith2 's side, that profile shows so much waste, like nuke all of those allocationssmile
FWIW I'm pretty sure 90+% of these allocations are the SqlClient implementation, rather than anything necessarily having to do with ADO.NET.
and doing a non-ADO.NET driver basically means it's unusable by all existing .NET layers (Dapper, EF, LLBLGen Pro...)
FWIW, the resurrection of this thread a few days ago is part of what prompted me to get started on DapperAOT. Pretty sure I'll have code working in the flavor of Dapper (although using the new DapperAOT) in lock-step with any-and-all progress here.
The current SqlClient codebase is too dangerous to use a "move fast and break things" approach. The number of users and the fact that we want people to move from System to Microsoft versions requires a slow and measured approach. The unfortunate side effect of that is that the speed of change is slow. I've been tinkering for two years and that trace is still abysmal compared to modern code. It is also important to note that the codebase is difficult to work with.
The existing SqlClient library will not go away. It will not change with sufficient speed to improve performance for the majority of users. I'm fairly sure that the performance of SqlClient lags behind that of other database providers, even if it doesn't I'm sure there can be a better performing implementation.
In writing an SqlCore library we can:
- identify if there are new patterns that would be better than existing ADO approaches for high throughput, low overhead, high concurrency, etc scenarios
- provide a good non-surprising (no async over sync) async way to access Sql Server data
- provide a viable mitigation for existing problems with high string and binary use scenarios
I think all of these things have value.
Things that aren't current goals as I see it:
- replace or deprecate ADO, just not practical nor demonstrably of value
- create an ADO capable SqlCore driver
- create a driver which implements the entire breadth of the existing SqlClient driver (SqlNotification? not now).
We do not have to choose to ignore the existing ADO api surface because we can probably provide most of it built on a new implementation. This doesn't have to be an explicit goal but it may be something to be aware of as a long term nice-to-have so we can flag up changes which would prevent it being possible.
The existing ADO surface has problems such as Task<T> GetFieldValue<T>(int i)
which can be worked around trivially with new implementation build using ValueTasks where it is appropriate. So perhaps we should work on a new implementation regardless of the api shape and then when we have something that works for trivial cases review what we've got and discuss further what direction we should take?
I'm happy with working towards specific use cases from StackOverflow if they can be provided. It's a point to work towards and learn about the problem space. Once we have something that is of worth or we collectively decide that the goal is impossible or improbable we can re-assess.
I don't see any problem with the general pattern of access which ado provides. As a sketch
public IAsyncEnumerable<Fortune> GetFortunes()
{
await using (var connection = GetSqlCoreConnection())
{
using (var command = GetSqlCoreCommand(connection, command properties ?))
{
using (var reader = await command.GetReader(reader properties))
{
while (await reader.GetNextRow())
{
yield return new Fortune { Id = await reader.GetValue<int>(0), Message = await reader.GetValue<string>(1) };
}
}
}
}
}
but that doesn't mean we have to start off with implementing all the interfaces that are in the System.Data namespace to do it.
I'm on @Wraith2 's side, that profile shows so much waste, like nuke all of those allocationssmile
FWIW I'm pretty sure 90+% of these allocations are the SqlClient implementation, rather than anything necessarily having to do with ADO.NET.
Yup, but the Task<int>
and Task<string>
are directly due to Task<T> GetFieldValue<T>(int i)
not being ValueTask. You should always use the async call when getting a string because it can easily cross a packet boundary but if you do that you pay for the task even if the string is in-memory and doesn't cross a packet boundary.
I'm on @Wraith2 's side, that profile shows so much waste, like nuke all of those allocations😄
Working on it, slowly... This is the good version, it used to gc every 16ms when I started. The waste is, as @roji said, largely due to implementation issues but it will take me another 2 years at least to get to make a dent in them. I'm going to do that work but while I do we can also investigate better ways to approach the functionality required with modern techniques. Things learnt on both projects may improve the other.
One more possible idea... If we don't want to be constrained by ADO.NET, but also aren't aiming to create a new abstraction, we could just write a driver with a public API surface that fits SQL Server, TDS, etc. This would be a low-level API which doesn't pretend or aim to be a general database API in any way (and once again, a shim could be built on top of that). It's all really a question of what we're trying to achieve here.
I think that would be a good place to start.
Great suggestion @roji !
Bottom line is that we need to measure this. If ADO.NET is adding significant overhead, then we either address that or don't ultimately use ADO.NET. Both of these options are on the table.
However, our current .NET/PostgreSQL solution on TechEmpower is #12 overall and uses ADO.NET. Analysis of this (which @roji has done extensively) indicates that, at least for queries, ADO.NET is not limiting. We have also found this with other experiments done in the past. So even though a few years ago my gut feeling was that we would have to drop ADO.NET, I now think I was wrong about that because the data we have just doesn't show it.
The idea behind issue #12 is to get the upper limit. Once we have that we can put that code into ADO.NET and get an initial read of the overhead. Until we have that data I think we're all guessing.
@Wraith2 I should point out again that the goal of this project is not to replace Microsoft.Data.SqlClient. Rather it is to compliment it with a fast driver that most likely doesn't support many of the features of Microsoft.Data.SqlClient. Microsoft.Data.SqlClient will always be an option for those who need ADO.NET or backwards compatibility.
When playing with NativeAOT and source generators, I would like provide some feedback
- I notice that DataReader when read from TDS read data in native format and place them then in
object[]
and when typed versions of GetValue<T> read from that object, cause boxing/unboxing. I'm not sure that I want direct TDS access, but maybe that would work for my scenario - SqlConnection utilizes a lot of different scenarios which not everybody need in specific application case. Maybe would be possible have some API which can control what code used using API, not Connection String. For example you should opt-in into connection pooling, azure, local instances, AD support, etc.
Hopefully this is not far of the goals of this project. just my 2c
I notice that DataReader when read from TDS read data in native format and place them then in object[] and when typed versions of GetValue read from that object, cause boxing/unboxing. I'm not sure that I want direct TDS access, but maybe that would work for my scenario
This is an old .net 1.1 api and because it uses object[] the boxing of valuetypes is unavoidable. Direct access through GetFieldValue<T>
should not box for any supported value type. I don't expect GetValues to be something we would use in a high perf scenario.
Thanks for explanation. Will go home, do my homework then.