MySqlConnector icon indicating copy to clipboard operation
MySqlConnector copied to clipboard

Support for streaming to a MySQL parameter from System.IO.Stream?

Open kendallb opened this issue 4 years ago • 48 comments

Working on porting Rebus to MySqlConnector, and there is support for streaming from a source directly into the database via a database parameter, but this appears to not be supported with MySqlConnector? is this a limitation of MySqlConnector, or a limitation of the MySQL wire protocol itself? Streaming the result set works, I just cannot find a way to stream data on insert?

System.NotSupportedException : Parameter type MemoryStream (DbType: Binary) not currently supported. Value: System.IO.MemoryStream

I am using it like this (stolen directly from the SQL Server version):

command.Parameters.Add("data", MySqlDbType.VarBinary, MathUtil.GetNextPowerOfTwo((int)source.Length)).Value = source;

kendallb avatar Feb 13 '21 07:02 kendallb

I am getting it working by simply turning it into a byte array and submitting that, but that is not particularly ideal if the data is large.

kendallb avatar Feb 13 '21 07:02 kendallb

Code in question is here:

https://github.com/kendallb/Rebus.MySqlConnector/blob/master/Rebus.MySqlConnector/MySqlConnector/DataBus/MySqlDataBusStorage.cs#L113

kendallb avatar Feb 13 '21 07:02 kendallb

is this a limitation of MySqlConnector, or a limitation of the MySQL wire protocol itself?

A little bit of both. 😀

MySqlConnector doesn't have support for it, hence the exception.

It could certainly be added (as a convenience method for this use case). However, the MySQL text protocol requires the entire INSERT SQL statement to be sent as a length-prefixed packet. Thus, MySqlConnector could have to convert the entire Stream to an in-memory blob; it couldn't really be streamed. (Various reserved bytes in the Stream have to be escaped, so the length can't be known ahead of time just by checking Stream.Length. MySqlConnector could read the Stream twice, once to compute the length, and once to send it, but this might be inefficient or just not even supported for some forward-only Streams.)

Still, this is probably worth adding, even if just for API compatibility with other ADO.NET providers, even if it doesn't support actual streaming, at least initially.

bgrainger avatar Feb 13 '21 21:02 bgrainger

Right, it would be nice to support it just to make it more compatible with other ADO.net connectors like SQL Server, even if under the hood it just does what I did, and convert the stream into a byte array. That does mean it would crash out if someone tries to insert a massive blob, but most folks don't have the max packet size for MySQL set to a massive number anyway (in our case it's 16MB for production) so unless the MySQL wire protocol supported proper streaming, it would still be limited to whatever that size is. Clearly SQL Server supports massive blobs as the Rebus unit test for the code in question inserts a 100MB file, which naturally crapped itself on my test system with a 16MB max packet size :)

At the end of the day folks shouldn't be inserting massive blobs into MySQL anyway, as it's the wrong tool for the job. if you need to store massive blobs, they should be stored in a storage bucket somewhere (Amazon S3, Google etc) and a link to the resource stored in MySQL, so I don't think it's a big limitation.

But, it would be nice when porting code like this for it to just 'work', at least until you actually try to insert something massive :).

kendallb avatar Feb 13 '21 22:02 kendallb

That does mean it would crash out if someone tries to insert a massive blob, but most folks don't have the max packet size for MySQL set to a massive number anyway (in our case it's 16MB for production)

That's a good point: usually one is quite limited in the maximum blob size they can insert, due to this limit.

bgrainger avatar Feb 13 '21 22:02 bgrainger

Updated ADO.NET tests; it appears that SqlClient is the only other library that supports using a Stream as a parameter value.

Npgsql has decided not to support it (and throws a NotSupportedException): https://github.com/npgsql/npgsql/issues/550

Most other libraries "succeed" but write "System.IO.MemoryStream" to the database column, introducing silent data corruption.

bgrainger avatar Feb 14 '21 02:02 bgrainger

Yeah it's 6 of one, half a dozen of another. When you get a nice exception at least you know stuff is not working and can fix it. In some ways I suppose having the developer specifically write the code like I did to turn it into a byte array means you are then fully aware of the consequences (memory usage). That might be better than having someone think they can stream a 1G file into a database blob only to find it runs out of memory :)

kendallb avatar Feb 14 '21 04:02 kendallb

I'm adding support for MemoryStream. If TryGetBuffer succeeds, that will be written directly to the packet (which should be efficient). Otherwise, ToArray() will be called, which will duplicate all the data before copying it (inefficient).

A different approach (e.g., that read repeatedly into a 64KiB buffer) could be implemented, which would work for all Stream types. That could replace the ToArray() code path in the future if that were a pressing need.

(A concern, as discussed above, with supporting any arbitrary Stream is that it might make people think they can stream a very large binary file to MySQL Server, but that's not actually possible.)

bgrainger avatar Feb 14 '21 16:02 bgrainger

Ok, I think supporting any stream should be implemented to make it really useful. A memory stream itself is not all that useful as all the data will already be in memory in that case so not sure it buys you much other than avoiding the second copy. A lot of the time it’s likely to be a real stream like a file stream or network stream, so the buffering approach would make the most sense to me?

If the buffering approach is implemented, does that mean it would avoid the max packet size issue? Or will that still be a problem? If it can avoid that then at least it would be possible to stream large blob data in, larger that the max packer size.

kendallb avatar Feb 14 '21 17:02 kendallb

MemoryStream is in 1.3.0-beta.4: https://www.nuget.org/packages/MySqlConnector/1.3.0-beta.4

MySqlBulkCopy is the closest thing to streaming that MySqlConnector has; the right way to support Stream is probably to allow it to be part of a DataTable that can be loaded through MySqlBulkCopy.

bgrainger avatar Feb 14 '21 18:02 bgrainger

It’s crazy but I have never had a need for the data table class myself (we use everything via a micro ORM), so it’s not clear to me how you would stream a file off disk for instance, into a data table that can the be streamed to MySQL via the bulk copy class? Unfortunately the SQL Server sample code just shows how to toss data from one table to another using it.

kendallb avatar Feb 14 '21 18:02 kendallb

Assuming it’s possible to copy the data via the MySQL wire protocol via a 64K buffer and avoid the max packet size issues, I think that’s the best approach. If that is not possible then maybe I need to learn how to use a data table and bulk copy in that way for rebus.

kendallb avatar Feb 14 '21 18:02 kendallb

DataTable is just currently the public API for MySqlBulkCopy; it wouldn't be a hard requirement.

bgrainger avatar Feb 14 '21 18:02 bgrainger

Assuming it’s possible to copy the data via the MySQL wire protocol via a 64K buffer and avoid the max packet size issues, I think that’s the best approach.

AFAIK that's only possible by first executing a LOAD DATA command, then streaming the data to the server. MySqlConnector exposes that through MySqlBulkLoader (which would be too low-level for this need) and MySqlBulkCopy (which uses it under the hood).

But an INSERT statement can never be streamed that way, so supporting an arbitrary Stream for MySqlParameter.Value would run into hard max_packet_size limitations quickly.

bgrainger avatar Feb 14 '21 18:02 bgrainger

Looks like there has to be a way to support it, as it’s officially supported for the C API?

https://dev.mysql.com/doc/c-api/8.0/en/mysql-stmt-send-long-data.html

kendallb avatar Feb 14 '21 18:02 kendallb

https://stackoverflow.com/questions/29600738/optimal-way-to-store-blobs-larger-than-max-allowed-packet-in-mysql-innodb

kendallb avatar Feb 14 '21 18:02 kendallb

I don't think so, since from that page:

The max_allowed_packet system variable controls the maximum size of parameter values that can be sent with mysql_stmt_send_long_data().

EDIT: This may just mean the maximum size of each chunk, not the total length?

bgrainger avatar Feb 14 '21 18:02 bgrainger

people say it works for Java and PHP, so it must be possible?

https://stackoverflow.com/questions/3754238/is-there-any-way-to-insert-a-large-value-in-a-mysql-db-without-changing-max-allo

kendallb avatar Feb 14 '21 18:02 kendallb

A good point from that SO question:

Still, even if I manage to INSERT, say, 20MB BLOB with max_allowed_packet=4MB how do I SELECT it back? I don't see how I can do it.

It seems pointless to invent a way of inserting arbitrarily large data that can't be retrieved. Additionally, using MySqlBulkCopy to insert a single row would be a rather awkward API.

So perhaps INSERT should just support arbitrary Stream parameters and it's not going to be that big a problem in practice to buffer them in memory before sending them to the server.

bgrainger avatar Feb 14 '21 18:02 bgrainger

You can already receive large data as it’s already possible with the connector to read data into a stream.

kendallb avatar Feb 14 '21 18:02 kendallb

You can already receive large data as it’s already possible with the connector to read data into a stream.

No, because there's no way for the server to send it without exceeding max_allowed_packet. The Stream is just an abstraction. (And in MySqlConnector, it just creates a read-only MemoryStream on the data that's already in memory.)

bgrainger avatar Feb 14 '21 18:02 bgrainger

people say it works for Java and PHP, so it must be possible?

I can look into implementing mysql_stmt_send_long_data in MySqlConnector. It only works for the "binary" protocol, so you would have to call MySqlCommand.Prepare in order to use it.

bgrainger avatar Feb 14 '21 18:02 bgrainger

So this example us pretty pointless then? Never tried it, since I can’t get data into the DB. Lol

https://dev.mysql.com/doc/connector-net/en/connector-net-programming-blob-reading.html

kendallb avatar Feb 14 '21 18:02 kendallb

people say it works for Java and PHP, so it must be possible?

I can look into implementing mysql_stmt_send_long_data in MySqlConnector. It only works for the "binary" protocol, so you would have to call MySqlCommand.Prepare in order to use it.

Sure, if a prepared statement is needed at least it solves the problem and I can make all this work nicely in rebus.

kendallb avatar Feb 14 '21 18:02 kendallb

Also as much as I think storing large data in MySQL is silly, in the case of rebus for a message transport it quite useful to support data larger than the max packet size because the point of a message queue is it’s all transient data. It does in and out really fast and it makes things so much simpler if it’s all just handled in the message structures and apis and not something you need to resort to something like a shared storage block to support.

The next thing I need to figure out is why the message reading via MySQL is so much slower than the SQL server version. Alas we still use MySQL 5.7 and it does not support ignoring locked row level data like SQL server does, so their simple approach of deleting the row in the transaction and having other readers simply ignore it can’t work. It just results in deadlocks.

kendallb avatar Feb 14 '21 18:02 kendallb

The documentation for COM_STMT_SEND_LONG_DATA: https://dev.mysql.com/doc/internals/en/com-stmt-send-long-data.html

MySqlConnector could determine if a prepared command had a Stream supplied for any parameter. If so, it could send the data from the stream in repeated chunks using COM_STMT_SEND_LONG_DATA. Then it could execute the statement, binding the rest of the parameters.

The implementation problem would be that currently, multiple commands are batched together into one packet. (MySQL doesn't allow multiple statements (e.g., INSERT INTO tbl1 VALUES(...); INSERT INTO tbl2 VALUES(...);) to be prepared, but MySqlConnector permits this by breaking the statement into multiple parts, which are buffered in memory and then all sent together. To enable data to be streamed, this buffering would have to be undone. (This isn't impossible; it just makes the implementation slightly more difficult than would be ideal.)

bgrainger avatar Feb 14 '21 21:02 bgrainger

Interesting. So the solution would be to not do the statement batching when a stream is involved.

Does MySqlConnector only do the splitting if the statements are prepared? I just found something interesting. As I mentioned above I was looking into how to speed up the performance of Rebus.MySqlConnector compared to SQL Server. The MySqlConnector version was able to insert data into the message queue way faster than SQL does when using the least transport, but SQL blew the doors off MySQL when it came to pulling the data back out. Part of the problem is the lack of row level lock ignoring for select statements in MySQL 5.7, but it's not all of it. I tried changing things somewhat to speed it up, but did not succeed. Here are some results:

SQL Server:

*** Using NORMAL SQL transport *** Inserted 1000 messages in 0.1 s - that's 6792.9 msg/s 1000 messages received in 0.1 s - that's 10363.0 msg/s

*** Using LEASE-BASED SQL transport *** Inserted 1000 messages in 3.3 s - that's 305.7 msg/s 1000 messages received in 0.3 s - that's 2944.6 msg/s

MySQLConnector:

*** Using NORMAL SQL transport *** Inserted 1000 messages in 0.2 s - that's 6537.2 msg/s 1000 messages received in 5.3 s - that's 188.4 msg/s

*** Using LEASE-BASED SQL transport *** Inserted 1000 messages in 0.2 s - that's 4059.2 msg/s 1000 messages received in 5.3 s - that's 188.7 msg/s

Now what is super interesting, is that I just back ported the same code to run on the Oracle connector, and it was WAY faster than MySqlConnector for the read operations (not as fast as SQL Server, but significantly faster):

Oracle Connector:

*** Using NORMAL SQL transport *** Inserted 1000 messages in 0.5 s - that's 2061.7 msg/s 1000 messages received in 0.8 s - that's 1295.1 msg/s

*** Using LEASE-BASED SQL transport *** Inserted 1000 messages in 0.6 s - that's 1757.0 msg/s 1000 messages received in 0.9 s - that's 1090.4 msg/s

So clearly from the above the MySqlConnector version is quite a bit faster than the oracle version, but for receiving it is not. The big difference is this library is written to be async all the way through, hence all of the SQL operations in the library are async calls, but as we know the Oracle connector is not really async at all, it just does sync calls with async semantics. I suspect that is why the inserts are so much faster since it can send a lot more over due to using async, but for the receives, I wonder if either the async stuff is making it slower, or if it's something to do with the statement batching you mentioned above?

The reason I ask is the receive operation is implemented as multiple SQL statements that get all sent together in the same transaction, designed in such a way that we pull out the next message, then update it to mark it as being processed (to avoid the row level locking stuff I mentioned) and then selecting the data out. Very similar to how SQL Server is done, except that with SQL Server it does not mark it as processing, it deletes it and the transaction is kept open until the message is successfully consumer (which will not work without row level lock ignores for select, which is not in 5.7).

https://github.com/kendallb/Rebus.MySqlConnector/blob/master/Rebus.MySql/MySql/Transport/MySqlTransport.cs#L295

SELECT id INTO @id
FROM {tableName} 
WHERE visible < now(6) AND 
	  expiration > now(6) AND
	  processing = 0 
ORDER BY priority DESC, 
		 visible ASC, 
		 id ASC 
LIMIT 1
FOR UPDATE;

SELECT id,
	   headers,
	   body
FROM {tableName}
WHERE id = @id
LIMIT 1;

UPDATE {tableName} 
SET processing = 1 
WHERE id = @id;

SET @id = null";

So either this statement splitting stuff you mentioned is what is slowing it down, or it's that this particular operation does not do so well with async?

kendallb avatar Feb 14 '21 23:02 kendallb

No, I don't think it's the async stuff. In that test I can easily disable all the async so only one operation runs at a time, and that dramatically slowed down the MySqlConnector version as well as the Oracle one, but the oracle one was much faster still. 9.3 msg/s for MySqlConnector and 626.4 msg/s or Oracle. So quite a bit slower in both cases.

If you are interested in profiling this to see where it's so slow, the test to run is this one:

https://github.com/kendallb/Rebus.MySqlConnector/blob/master/Rebus.MySql.Tests/Transport/TestMySqlTransportReceivePerformance.cs

You need a database user called mysql with the password mysql that has full access to the rebus% databases (and make a schema called rebus2_test.

kendallb avatar Feb 14 '21 23:02 kendallb

MySqlConnector could determine if a prepared command had a Stream supplied for any parameter. If so, it could send the data from the stream in repeated chunks using COM_STMT_SEND_LONG_DATA.

Well, so much for that theory. If you actually try this, you get the exception Parameter of prepared statement which is set through mysql_send_long_data() is longer than 'max_allowed_packet' bytes.

Sample code to implement this is here: https://github.com/bgrainger/MySqlConnector/tree/send-long-data

The only benefit this provides is being able to insert a row whose individual columns are less than max_allowed_packet bytes, but whose total size would exceed max_allowed_packet bytes (by sending the individual columns separately instead of combined into one packet for the whole row).

bgrainger avatar Feb 15 '21 00:02 bgrainger

If you are interested in profiling this to see where it's so slow

I am. This library should be faster than Oracle's MySQL Connector/NET for all use cases, except that XyzAsync methods do add some overhead when they're actually implemented as async. Thanks for the test case; I will take a look later.

Which version of MySql.Data were you using? 8.0.22/8.0.23 has a severe performance regression when reading rows, so I'd be very surprised if it were the faster library.

bgrainger avatar Feb 15 '21 00:02 bgrainger