MySqlConnector
MySqlConnector copied to clipboard
Support for streaming to a MySQL parameter from System.IO.Stream?
Working on porting Rebus to MySqlConnector, and there is support for streaming from a source directly into the database via a database parameter, but this appears to not be supported with MySqlConnector? is this a limitation of MySqlConnector, or a limitation of the MySQL wire protocol itself? Streaming the result set works, I just cannot find a way to stream data on insert?
System.NotSupportedException : Parameter type MemoryStream (DbType: Binary) not currently supported. Value: System.IO.MemoryStream
I am using it like this (stolen directly from the SQL Server version):
command.Parameters.Add("data", MySqlDbType.VarBinary, MathUtil.GetNextPowerOfTwo((int)source.Length)).Value = source;
I am getting it working by simply turning it into a byte array and submitting that, but that is not particularly ideal if the data is large.
Code in question is here:
https://github.com/kendallb/Rebus.MySqlConnector/blob/master/Rebus.MySqlConnector/MySqlConnector/DataBus/MySqlDataBusStorage.cs#L113
is this a limitation of MySqlConnector, or a limitation of the MySQL wire protocol itself?
A little bit of both. 😀
MySqlConnector doesn't have support for it, hence the exception.
It could certainly be added (as a convenience method for this use case). However, the MySQL text protocol requires the entire INSERT
SQL statement to be sent as a length-prefixed packet. Thus, MySqlConnector could have to convert the entire Stream
to an in-memory blob; it couldn't really be streamed. (Various reserved bytes in the Stream
have to be escaped, so the length can't be known ahead of time just by checking Stream.Length
. MySqlConnector could read the Stream
twice, once to compute the length, and once to send it, but this might be inefficient or just not even supported for some forward-only Streams.)
Still, this is probably worth adding, even if just for API compatibility with other ADO.NET providers, even if it doesn't support actual streaming, at least initially.
Right, it would be nice to support it just to make it more compatible with other ADO.net connectors like SQL Server, even if under the hood it just does what I did, and convert the stream into a byte array. That does mean it would crash out if someone tries to insert a massive blob, but most folks don't have the max packet size for MySQL set to a massive number anyway (in our case it's 16MB for production) so unless the MySQL wire protocol supported proper streaming, it would still be limited to whatever that size is. Clearly SQL Server supports massive blobs as the Rebus unit test for the code in question inserts a 100MB file, which naturally crapped itself on my test system with a 16MB max packet size :)
At the end of the day folks shouldn't be inserting massive blobs into MySQL anyway, as it's the wrong tool for the job. if you need to store massive blobs, they should be stored in a storage bucket somewhere (Amazon S3, Google etc) and a link to the resource stored in MySQL, so I don't think it's a big limitation.
But, it would be nice when porting code like this for it to just 'work', at least until you actually try to insert something massive :).
That does mean it would crash out if someone tries to insert a massive blob, but most folks don't have the max packet size for MySQL set to a massive number anyway (in our case it's 16MB for production)
That's a good point: usually one is quite limited in the maximum blob size they can insert, due to this limit.
Updated ADO.NET tests; it appears that SqlClient is the only other library that supports using a Stream
as a parameter value.
Npgsql has decided not to support it (and throws a NotSupportedException
): https://github.com/npgsql/npgsql/issues/550
Most other libraries "succeed" but write "System.IO.MemoryStream"
to the database column, introducing silent data corruption.
Yeah it's 6 of one, half a dozen of another. When you get a nice exception at least you know stuff is not working and can fix it. In some ways I suppose having the developer specifically write the code like I did to turn it into a byte array means you are then fully aware of the consequences (memory usage). That might be better than having someone think they can stream a 1G file into a database blob only to find it runs out of memory :)
I'm adding support for MemoryStream
. If TryGetBuffer
succeeds, that will be written directly to the packet (which should be efficient). Otherwise, ToArray()
will be called, which will duplicate all the data before copying it (inefficient).
A different approach (e.g., that read repeatedly into a 64KiB buffer) could be implemented, which would work for all Stream
types. That could replace the ToArray()
code path in the future if that were a pressing need.
(A concern, as discussed above, with supporting any arbitrary Stream
is that it might make people think they can stream a very large binary file to MySQL Server, but that's not actually possible.)
Ok, I think supporting any stream should be implemented to make it really useful. A memory stream itself is not all that useful as all the data will already be in memory in that case so not sure it buys you much other than avoiding the second copy. A lot of the time it’s likely to be a real stream like a file stream or network stream, so the buffering approach would make the most sense to me?
If the buffering approach is implemented, does that mean it would avoid the max packet size issue? Or will that still be a problem? If it can avoid that then at least it would be possible to stream large blob data in, larger that the max packer size.
MemoryStream
is in 1.3.0-beta.4: https://www.nuget.org/packages/MySqlConnector/1.3.0-beta.4
MySqlBulkCopy is the closest thing to streaming that MySqlConnector has; the right way to support Stream
is probably to allow it to be part of a DataTable
that can be loaded through MySqlBulkCopy
.
It’s crazy but I have never had a need for the data table class myself (we use everything via a micro ORM), so it’s not clear to me how you would stream a file off disk for instance, into a data table that can the be streamed to MySQL via the bulk copy class? Unfortunately the SQL Server sample code just shows how to toss data from one table to another using it.
Assuming it’s possible to copy the data via the MySQL wire protocol via a 64K buffer and avoid the max packet size issues, I think that’s the best approach. If that is not possible then maybe I need to learn how to use a data table and bulk copy in that way for rebus.
DataTable
is just currently the public API for MySqlBulkCopy
; it wouldn't be a hard requirement.
Assuming it’s possible to copy the data via the MySQL wire protocol via a 64K buffer and avoid the max packet size issues, I think that’s the best approach.
AFAIK that's only possible by first executing a LOAD DATA
command, then streaming the data to the server. MySqlConnector exposes that through MySqlBulkLoader
(which would be too low-level for this need) and MySqlBulkCopy
(which uses it under the hood).
But an INSERT
statement can never be streamed that way, so supporting an arbitrary Stream
for MySqlParameter.Value
would run into hard max_packet_size
limitations quickly.
Looks like there has to be a way to support it, as it’s officially supported for the C API?
https://dev.mysql.com/doc/c-api/8.0/en/mysql-stmt-send-long-data.html
https://stackoverflow.com/questions/29600738/optimal-way-to-store-blobs-larger-than-max-allowed-packet-in-mysql-innodb
I don't think so, since from that page:
The
max_allowed_packet
system variable controls the maximum size of parameter values that can be sent withmysql_stmt_send_long_data()
.
EDIT: This may just mean the maximum size of each chunk, not the total length?
people say it works for Java and PHP, so it must be possible?
https://stackoverflow.com/questions/3754238/is-there-any-way-to-insert-a-large-value-in-a-mysql-db-without-changing-max-allo
A good point from that SO question:
Still, even if I manage to INSERT, say, 20MB BLOB with
max_allowed_packet=4MB
how do I SELECT it back? I don't see how I can do it.
It seems pointless to invent a way of inserting arbitrarily large data that can't be retrieved. Additionally, using MySqlBulkCopy
to insert a single row would be a rather awkward API.
So perhaps INSERT
should just support arbitrary Stream
parameters and it's not going to be that big a problem in practice to buffer them in memory before sending them to the server.
You can already receive large data as it’s already possible with the connector to read data into a stream.
You can already receive large data as it’s already possible with the connector to read data into a stream.
No, because there's no way for the server to send it without exceeding max_allowed_packet
. The Stream
is just an abstraction. (And in MySqlConnector, it just creates a read-only MemoryStream
on the data that's already in memory.)
people say it works for Java and PHP, so it must be possible?
I can look into implementing mysql_stmt_send_long_data
in MySqlConnector. It only works for the "binary" protocol, so you would have to call MySqlCommand.Prepare
in order to use it.
So this example us pretty pointless then? Never tried it, since I can’t get data into the DB. Lol
https://dev.mysql.com/doc/connector-net/en/connector-net-programming-blob-reading.html
people say it works for Java and PHP, so it must be possible?
I can look into implementing
mysql_stmt_send_long_data
in MySqlConnector. It only works for the "binary" protocol, so you would have to callMySqlCommand.Prepare
in order to use it.
Sure, if a prepared statement is needed at least it solves the problem and I can make all this work nicely in rebus.
Also as much as I think storing large data in MySQL is silly, in the case of rebus for a message transport it quite useful to support data larger than the max packet size because the point of a message queue is it’s all transient data. It does in and out really fast and it makes things so much simpler if it’s all just handled in the message structures and apis and not something you need to resort to something like a shared storage block to support.
The next thing I need to figure out is why the message reading via MySQL is so much slower than the SQL server version. Alas we still use MySQL 5.7 and it does not support ignoring locked row level data like SQL server does, so their simple approach of deleting the row in the transaction and having other readers simply ignore it can’t work. It just results in deadlocks.
The documentation for COM_STMT_SEND_LONG_DATA
: https://dev.mysql.com/doc/internals/en/com-stmt-send-long-data.html
MySqlConnector could determine if a prepared command had a Stream
supplied for any parameter. If so, it could send the data from the stream in repeated chunks using COM_STMT_SEND_LONG_DATA
. Then it could execute the statement, binding the rest of the parameters.
The implementation problem would be that currently, multiple commands are batched together into one packet. (MySQL doesn't allow multiple statements (e.g., INSERT INTO tbl1 VALUES(...); INSERT INTO tbl2 VALUES(...);
) to be prepared, but MySqlConnector permits this by breaking the statement into multiple parts, which are buffered in memory and then all sent together. To enable data to be streamed, this buffering would have to be undone. (This isn't impossible; it just makes the implementation slightly more difficult than would be ideal.)
Interesting. So the solution would be to not do the statement batching when a stream is involved.
Does MySqlConnector only do the splitting if the statements are prepared? I just found something interesting. As I mentioned above I was looking into how to speed up the performance of Rebus.MySqlConnector compared to SQL Server. The MySqlConnector version was able to insert data into the message queue way faster than SQL does when using the least transport, but SQL blew the doors off MySQL when it came to pulling the data back out. Part of the problem is the lack of row level lock ignoring for select statements in MySQL 5.7, but it's not all of it. I tried changing things somewhat to speed it up, but did not succeed. Here are some results:
SQL Server:
*** Using NORMAL SQL transport *** Inserted 1000 messages in 0.1 s - that's 6792.9 msg/s 1000 messages received in 0.1 s - that's 10363.0 msg/s
*** Using LEASE-BASED SQL transport *** Inserted 1000 messages in 3.3 s - that's 305.7 msg/s 1000 messages received in 0.3 s - that's 2944.6 msg/s
MySQLConnector:
*** Using NORMAL SQL transport *** Inserted 1000 messages in 0.2 s - that's 6537.2 msg/s 1000 messages received in 5.3 s - that's 188.4 msg/s
*** Using LEASE-BASED SQL transport *** Inserted 1000 messages in 0.2 s - that's 4059.2 msg/s 1000 messages received in 5.3 s - that's 188.7 msg/s
Now what is super interesting, is that I just back ported the same code to run on the Oracle connector, and it was WAY faster than MySqlConnector for the read operations (not as fast as SQL Server, but significantly faster):
Oracle Connector:
*** Using NORMAL SQL transport *** Inserted 1000 messages in 0.5 s - that's 2061.7 msg/s 1000 messages received in 0.8 s - that's 1295.1 msg/s
*** Using LEASE-BASED SQL transport *** Inserted 1000 messages in 0.6 s - that's 1757.0 msg/s 1000 messages received in 0.9 s - that's 1090.4 msg/s
So clearly from the above the MySqlConnector version is quite a bit faster than the oracle version, but for receiving it is not. The big difference is this library is written to be async all the way through, hence all of the SQL operations in the library are async calls, but as we know the Oracle connector is not really async at all, it just does sync calls with async semantics. I suspect that is why the inserts are so much faster since it can send a lot more over due to using async, but for the receives, I wonder if either the async stuff is making it slower, or if it's something to do with the statement batching you mentioned above?
The reason I ask is the receive operation is implemented as multiple SQL statements that get all sent together in the same transaction, designed in such a way that we pull out the next message, then update it to mark it as being processed (to avoid the row level locking stuff I mentioned) and then selecting the data out. Very similar to how SQL Server is done, except that with SQL Server it does not mark it as processing, it deletes it and the transaction is kept open until the message is successfully consumer (which will not work without row level lock ignores for select, which is not in 5.7).
https://github.com/kendallb/Rebus.MySqlConnector/blob/master/Rebus.MySql/MySql/Transport/MySqlTransport.cs#L295
SELECT id INTO @id
FROM {tableName}
WHERE visible < now(6) AND
expiration > now(6) AND
processing = 0
ORDER BY priority DESC,
visible ASC,
id ASC
LIMIT 1
FOR UPDATE;
SELECT id,
headers,
body
FROM {tableName}
WHERE id = @id
LIMIT 1;
UPDATE {tableName}
SET processing = 1
WHERE id = @id;
SET @id = null";
So either this statement splitting stuff you mentioned is what is slowing it down, or it's that this particular operation does not do so well with async?
No, I don't think it's the async stuff. In that test I can easily disable all the async so only one operation runs at a time, and that dramatically slowed down the MySqlConnector version as well as the Oracle one, but the oracle one was much faster still. 9.3 msg/s for MySqlConnector and 626.4 msg/s or Oracle. So quite a bit slower in both cases.
If you are interested in profiling this to see where it's so slow, the test to run is this one:
https://github.com/kendallb/Rebus.MySqlConnector/blob/master/Rebus.MySql.Tests/Transport/TestMySqlTransportReceivePerformance.cs
You need a database user called mysql with the password mysql that has full access to the rebus% databases (and make a schema called rebus2_test.
MySqlConnector could determine if a prepared command had a
Stream
supplied for any parameter. If so, it could send the data from the stream in repeated chunks usingCOM_STMT_SEND_LONG_DATA
.
Well, so much for that theory. If you actually try this, you get the exception Parameter of prepared statement which is set through mysql_send_long_data() is longer than 'max_allowed_packet' bytes
.
Sample code to implement this is here: https://github.com/bgrainger/MySqlConnector/tree/send-long-data
The only benefit this provides is being able to insert a row whose individual columns are less than max_allowed_packet
bytes, but whose total size would exceed max_allowed_packet
bytes (by sending the individual columns separately instead of combined into one packet for the whole row).
If you are interested in profiling this to see where it's so slow
I am. This library should be faster than Oracle's MySQL Connector/NET for all use cases, except that XyzAsync
methods do add some overhead when they're actually implemented as async. Thanks for the test case; I will take a look later.
Which version of MySql.Data were you using? 8.0.22/8.0.23 has a severe performance regression when reading rows, so I'd be very surprised if it were the faster library.