DBD-mysql icon indicating copy to clipboard operation
DBD-mysql copied to clipboard

4.042 improperly encoding blobs when sql_type is SQL_UNKNOWN_TYPE

Open cthulhuology opened this issue 7 years ago • 110 comments

The utf8 encoding changes have resulted in a regression that has unexpected side effects. Consider the following table:

create table foo (foo longblob);

and script:

use DBI;

my $dbh = DBI->connect('DBI:mysql:database=test','desktop','',{ mysql_enable_utf8 => 1 }) or die $DBI::errstr;

my $sth = $dbh->prepare('INSERT INTO foo (foo) values (?)'); $sth->execute("I18N Web Testing æøå");

my $rth = $dbh->prepare('SELECT foo FROM foo'); $rth->execute(); while (@row = $rth->fetchrow_array) { print $row[0],"\n"; }

Now the longblob mysql type should be considered a SQL_BLOB, but because the exec call doesn't call bind_param with the attrib set to SQL_BLOB, the default value of 0 is used SQL_UNKNOWN_TYPE and passed to bind_param. If you sv_dump at 939 in dbdimp.c you'll see:

SV = PV(0x162d950) at 0x15f8630 REFCNT = 1 FLAGS = (POK,IsCOW,READONLY,PROTECT,pPOK) PV = 0x15dc200 "I18N Web Testing \303\246\303\270\303\245"\0 CUR = 23 LEN = 25 COW_REFCNT = 0

And sql_type is the default value of 0.

the output of the scrip will print the value stored in the database:

I18N Web Testing æøå`

because it is double encoding the characters, and not preserving the blob value because the sql_type_is_binary check returns false because the SQL_UNKNOWN_TYPE value of sql_type.

One example of widely used code that exhibit this behavior is Apache::Session (Apache::Session::MySQL), which calls bind_param explicitly but never passes the SQL_BLOB at and of the call sites. Since Apache::Session is using a Storable (which is binary data stored in a blob) the additional UTF8 encoding can result in data corruption which can cause perl to crash with an OOM error when materialize is invoked. This is a potential security threat.

cthulhuology avatar Apr 06 '17 19:04 cthulhuology

The whole mysql_enable_utf8 was just experimental and totally broken prior to DBD::mysql 4.042. Experimental status was also written in documentation, see: https://metacpan.org/pod/release/CAPTTOFU/DBD-mysql-4.033/lib/DBD/mysql.pm#mysql_enable_utf8

I'm not sure what we can done here. Basically applications which trying to use experimental broken option provided by DBD::mysql < 4.042 and just broken.

Due to some fundamentals of perl scalars and also MySQL protocol itself, DBD::mysql does not type of bind parameter, so driver can use logic which you already described. When mysql_enable_utf8 is 0, then nothing is encoded/decoded and so treated as binary sequence of octets.

pali avatar Apr 07 '17 22:04 pali

In other words: DBD::mysql does not know that a parameter is being used for insert into a blob column. Since you set mysql_enable_utf8 it will encode it, unless bound with the SQL_BLOB or similar type. On retrieval it knows that the column is a binary type, so it does not decode it. The behavior you describe only worked accidentally before.

Grinnz avatar Apr 07 '17 22:04 Grinnz

The whole mysql_enable_utf8 was just experimental and totally broken prior to DBD::mysql 4.042.

I used it for 10 years without problems. I’m sure I would have had problems if I had tried things where the incongruity of its design got exposed, but I never needed to. So my code had no bugs – even though DBD::mysql was designed incorrectly.

Worse: http://grep.cpan.me/?q=mysql_enable_utf8+file%3A.pm There’s plenty of infrastructure-level code just on CPAN alone which sets this flag on behalf of the user.

In the face of these realities, the fact that the documentation declared the feature experimental back when it was introduced, 11 years ago, is simply irrelevant today.

But now mysql_enable_utf8 is fixed – I‘m not going argue about that. But users’ code that wasn’t broken before now is. And this was simply unnecessary.

ap avatar Apr 08 '17 13:04 ap

So... what do you suggest? As wrote I have no idea what can be done here. Probably the best solution would be to fix broken code.

And about mysql_enable_utf8 usage on cpan, you do not know how many of those modules are broken and how many expect that mysql_enable_utf8 is not buggy? If module is multi-dbd and set mysql_enable_utf8, pg_enable_utf8 and also sqlite_unicode then it expects that mysql_enable_utf8 is working (which means needs DBD::mysql >= 4.042).

pali avatar Apr 08 '17 13:04 pali

Of course many of these CPAN modules will produce broken results when used in any somewhat unusual configuration where the broken design actually matters. But a lot of people don’t try that, and for them it ends up not mattering that the design is not sound. For these people, there is no gain from fixing the broken design, it just creates busywork.

I’m not sure what I would suggest, specifically. The general idea is always the same though: don’t change how mysql_enable_utf8 works – for now at least –, but add the more correct encoding approach, and then give users a way to ask for it.

How exactly you do that depends on the code and what you need to achieve long term. New flag e.g. mysql_enable_encoding? New flag value e.g. mysql_enable_utf8 => 2? Some other approach maybe? Also, do you warn when mysql_enable_utf8 is used? Maybe detect problematic uses and only warn then? Depends on what’s easier to implement and cleaner as an interface. Overall I would weigh that choice based on how difficult it is to support both the old and new thing long-term. The easier it is to keep the old stuff around, the less old users should be pushed away from it.

(Of course when it comes to the docs, the old approach should be de-emphasised, and the new way should be shown as the obvious/standard way of doing things, to steer new users in that direction. Also, to get people away from the old way it is important to make sure that the new way doesn’t make too many things harder to do than the old way – esp. commonly needed things. Note however the old way must stay documented – because there is still code that uses it, and people who have to maintain that code need to be able to figure out what it does.)

Something along these lines.

ap avatar Apr 08 '17 17:04 ap

Problem with old mysql_enable_utf8 implementation is that it behave "pseudo-randomly". For frozen combination of perl at specific version, all cpan modules at specific version and user application it acts deterministic. But once either perl or any cpan module is updated old mysql_enable_utf8 (prior to 4.042) can change result of DBD::mysql. So result is basically unpredictable.

If users (and authors of other modules) are not going to fix broken code and need current behavior they should stay at specific version of perl and all cpan modules. Otherwise it is not guaranteed that also old DBD::mysql version (prior to 4.042) provide same results. It can lead to results: after updating cpan module XYZ to new version MySQL server in user application started returning wrong result, where XYZ can be module which has nothing to do with MySQL.

Supporting such pseudo-random behavior is not easy and funny.

And I really suggest to conserve such fragile code and disallow updating code around (which can fully break it).

What is probably more important: how to write mysql_enable_utf8 code which tries to be compatible with both old buggy DBD::mysql versions and new.

I created this pull request which adds "hacks" into documentation: https://github.com/perl5-dbi/DBD-mysql/pull/119

Please comment, test or provide another suggestion for it. To improve situation.

pali avatar Apr 10 '17 19:04 pali

But once either perl or any cpan module is updated old mysql_enable_utf8 (prior to 4.042) can change result of DBD::mysql.

Any CPAN module? How?

ap avatar Apr 12 '17 22:04 ap

If a CPAN module provides a string with a different internal representation, even if it's the same string, previous versions of DBD::mysql will treat it differently. We had this issue with output from Spreadsheet::ParseExcel that returned (perfectly normal) un-upgraded strings with byte values between 128 and 255, probably due to some dependency, causing errors when used as params to DBD::mysql with mysql_enable_utf8 set, as the internal representation it used was not UTF-8 encoded. Any module that provides you with a string could change what representation it gives you and break DBD::mysql while the string value is unchanged.

A more complete discussion of the issue is in https://rt.cpan.org/Ticket/Display.html?id=87428

Grinnz avatar Apr 12 '17 22:04 Grinnz

@Grinnz is right. And moreover un-upgraded strings are created by default from string constants if you do not use utf8 in your file from which those strings come from. Also some older perl versions generate un-upgraded string even if \N{U+XX} construction is used. And any perl function can upgrade or downgrade string as needed.

So please look at pull request https://github.com/perl5-dbi/DBD-mysql/pull/119 if it helps you or not...

pali avatar Apr 15 '17 12:04 pali

@Grinnz:

If a CPAN module provides a string with a different internal representation, even if it's the same string, previous versions of DBD::mysql will treat it differently.

Yeah, but that isn’t DBD::mysql changing its behaviour. That’s some other module changing its behaviour. If the user of both modules compensates for the change in the other module, DBD::mysql will get the same inputs as before, and will behave the same way as before. The behaviour of DBD::mysql is not changing, just because of an upgrade to some other module.

It’s perfectly predictable what it does, even if it’s not correct. “When the inputs change, the behaviour changes; when the inputs are the same, the behaviour is the same” is the opposite of random.

A more complete discussion of the issue is in https://rt.cpan.org/Ticket/Display.html?id=87428

Funny, there’s an almost verbatim precursor of my suggestion (but better thought out) already in there. As far as I can tell, doing it that would have worked just fine… with no breakage. Taking that approach would have given you what you wanted, without causing anyone else any pain. So why choose to cause pain when you don’t even gain anything from it?

@pali: So…

Supporting such pseudo-random behavior is not easy and funny.

… if this meant what @Grinnz was talking about, then I disagree that that behaviour is random (even just pseudo), I disagree that it’s hard to support solely on that basis, and I disagree about what’s not funny here (namely, changing it).

So please look at pull request #119 if it helps you or not...

That is indeed useful information in order to cope with the problems of the old interface. But I don’t see anything in there which would argue against adding the correct interface under a different flag – rather than changing mysql_enable_utf8 and thereby breaking already-working code.

Has there been any technical argument for not using a different flag? So far, all I have seen are arguments about the correctness of not using the UTF8 flag – which I do not disagree with at all.

ap avatar Apr 15 '17 17:04 ap

That’s some other module changing its behaviour.

I cannot agree here with you, because calling utf8::upgrade() or composing string via \x{...}\x{...} sequences, or via \N{...} is not behavior change. It is same behavior with exactly same function API.

What is changing here is low level representation, and you can compare it compiling program with different optimization flags, e.g. -O0 or -O2 in gcc. In both cases you get same program, just compiled with different assembler instructions.

And in perl usage of utf8::upgrade() or \N{...} is similar behavior. Perl just translate source file to different opcodes, but meaning of program is same...

Note that perl operators like . (dot, concatenation) or perl functions like substr calls utf8::upgrade() in some cases. And there is no guarantee that due to optimization in future it would not be changed (under which conditions is utf8::upgrade() called).

I wrote it is pseudo-random (not fully-random), which means it is deterministic. So I could agree with you that behavior can be predicted.

Originally I talk with DBD::mysql maintainers (half a year ago?) and wrote first proposed solution that introduction of new attribute for utf8 with hacking code around with supporting both of them can be possible. But IIRC we dismiss it due to introduction of big mess in code and reason that current behaviour does not make sense in perl.

Due to above fact and also that calling utf8::upgrade() is legal and already done by perl itself at lot of places and also that in documentation was explicitely written "experimental and may change" state, I introduced patches which fixes mysql_enable_utf8, rather then having two different UTF-8 supports... Note that I explicitely stated in pull request https://github.com/perl5-dbi/DBD-mysql/pull/67 that it would change behavior for more then month in review state and about 3 months there was on cpan engineering version. So I though it would be OK as nobody was against.

pali avatar Apr 15 '17 19:04 pali

Originally I talk with DBD::mysql maintainers (half a year ago?) and wrote first proposed solution that introduction of new attribute for utf8 with hacking code around with supporting both of them can be possible. But IIRC we dismiss it due to introduction of big mess in code and reason that current behaviour does not make sense in perl.

Please reconsider this stance. In as much as the new behavior is "correct", the old behavior was widely deployed and has a tremendous amount of inertia. Migrating to the new behavior will require coordination and substantial effort. We (Request Tracker) are willing to undergo that effort, but we do need to prepare for it. Given that we are in the middle of a stable series and DBD-mysql 4.042 is getting traction, we are unable to do such a migration right now. Our only recourse right now is to refuse to install under 4.042, forcing users to install 4.041 by hand.

sartak avatar May 22 '17 16:05 sartak

old behavior was widely deployed

What we learnt from this is that developers do not read documentation nor care about states like "This option is experimental and may change in future versions.".

Now I'm thinking how could be experimental features which are not expected to work correctly or which can be removed/changed at any time introduced into modules...

pali avatar Jun 16 '17 13:06 pali

What we learnt from this is that developers do not read documentation nor care about states like "This option is experimental and may change in future versions.".

How seriously can you expect them to take that after the feature has been “experimental” without changes in behaviour for 10 years?

ap avatar Jun 16 '17 14:06 ap

How seriously can you expect them to take that after the feature has been “experimental” without changes in behaviour for 10 years?

Furthermore see this commit from a year ago and especially its message: 026fa02d80afc7f3473e67b9f63fe1b74575da3e

sartak avatar Jun 16 '17 16:06 sartak

Yeah, I removed the 'this is experimental' message last year because I knew lots of software depended at that point on this switch because it was the only way to get some form of sane behaviour from DBD::mysql - which had lots of other issues, but also mostly worked for most people.

I'm really not so happy with the situation that emerged. I though we fixed lots of smaller issues but I also thought we had enough unit test coverage to not break existing software using DBD::mysql for people. We explicitly sollicited feedback after posting development releases to CPAN by making blog posts on b.p.o, sending out mails to mailing lists, and of course we only got actual feedback (like this issue) after we released the stable.

@CaptTofu please advise!

I think we need to look at doing what the original code did and do whatever worked for most users.

mbeijen avatar Jun 16 '17 18:06 mbeijen

It doesn’t seem too late to add a mysql_enable_unicode with the new behaviour and put mysql_enable_utf8 back the way it was. It’s been a couple months, but a couple months is not long: the cascade of packagers and package upgrades is much longer than that, so a lot of users are not yet affected.

ap avatar Jun 17 '17 05:06 ap

How seriously can you expect them to take that after the feature has been “experimental” without changes in behaviour for 10 years?

Because there were open bugs that it does not work for 10 years?

pali avatar Jun 17 '17 16:06 pali

What software doesn’t have “some open bugs”? I do not believe you hold even yourself to that standard. If that was the bar for what software is safe to use… what would be left?

ap avatar Jun 17 '17 19:06 ap

Hi, I'm an end user that would like to share with you the impact of this change. I work at a company that has 100GB of files stored in MySQL databases of 5+ years old. After upgrading this module all all user files that are inserted into tables with LONGBLOB columns are corrupt when they are retrieved. The manual says that BLOB files are not encoded but it does (see testing code below). Now we are challenged to find out which records are corrupt, and which ones are not. Unsetting the flag is not an option as all our text data is utf8 mixed with LONGBLOBs. The proposed solutions asks for a review of every written SQL statement and start specifying for each column its DataType. This is not user friendly as it is more time consuming to write code. This change breaks with Perl's paradigm that backwards compatibility is safeguarded by the habbit to test all changes against all exiting CPAN modules before releasing it. The arguments from this change are understandable from a technical point of view, but we prefer to maintain usability by keeping the module backwards compatible. Thank you for your understanding.

Manual text:

Input bind parameters of binary types (SQL_BIT, SQL_BLOB, SQL_BINARY, SQL_VARBINARY and SQL_LONGVARBINARY) are not touched regardless of the mysql_enable_utf8 attribute state.

However, the following code inserts corrupt binary data:

use DBI qw(:sql_types);

open FILE, "test.bin"; binmode FILE; read(FILE, my $file, 100000000); close FILE;

$dbh = DBI->connect($dsn,$user,$password,{mysql_enable_utf8 => 1}); $sth = $dbh->prepare("INSERT INTO pdo_blob (the_blob) VALUES(?)"); $sth->bind_param( 1, $file); # corruption #$sth->bind_param(1, $file, SQL_BLOB); # no corruption $sth->execute(); ($fileNew) = $dbh->selectrow_array("SELECT the_blob FROM pdo_blob ORDER BY id DESC LIMIT 1"); print ($file eq $fileNew ? "Equal" : "Not equal");

This change also make it impossible to use the following one-liner as it requires the data type of all placeholders to be specified: $dbh->do($SQL,undef,@$placeholderValues);

nightlight1 avatar Jun 28 '17 13:06 nightlight1

Input bind parameters of binary types ...

$sth->bind_param( 1, $file); # corruption

This parameter is not bound as binary type.

#$sth->bind_param(1, $file, SQL_BLOB); # no corruption

This one is.

This change also make it impossible to use the following one-liner as it requires the data type of all placeholders to be specified: $dbh->do($SQL,undef,@$placeholderValues);

It is limitation of MySQL. If mysql_server_prepare is turned off (by default), then all question marks are replaced by bind values in DBI client code. And DBI send to MySQL server one non-parametrized SQL string. If mysql_server_prepare is turned on, then DBI client send SQL statement with question marks to prepare on MySQL server and after that it send parameters. But MySQL server does not provide information if value assigned to question mark is binary or not. Therefore MySQL client has no idea if it is binary or not and user is responsible for it. It is limitation of MySQL.

pali avatar Jun 28 '17 13:06 pali

I understand the underlying motivation to make this change. You are right that is a bug that needs to be fixed. I wanted to share with you the impact of this change from a user perspective as I believe that this necessary fix comes with the side effect of corrupting data without the user/developer noticing it. I hope this will make the maintainer of this module consider to change the module in such a way that backwards compatibility is maintained.

nightlight1 avatar Jun 28 '17 13:06 nightlight1

@pali turning on mysql_enable_utf8 has been the least broken option for handling unicode with DBD::mysql for over a decade. It's been basically standard for e.g. any DBIx::Class app that uses unicode w/mysql for most of that time.

The only thing you can do here that won't cause massive corruption of people's production data is to keep it behaving the old way, even if the old way is stupid. You can add a warning that the old way is stupid and that people should consider migrating to the new way, but the old code must stay bugwards compatible or you're going to trash people's production data.

This is the sort of backcompat problem that most sucks to have as a maintainer, and I do very much feel for you, but there really ain't any alternative here. Sorry.

shadowcat-mst avatar Jun 28 '17 14:06 shadowcat-mst

Dear @pali Are you the maintainer of this module? What do you think about the below proposal together with mentioning in the documentation that a value 1 is not sufficient as it lacks encoding of the input statement+parameters but that it remains there for backward compatibility, and value 2 is preferred for correct UTF-8 handling in combination with the note that specifying the datatype of each placeholder value is mandatory for this setting?

Working of 4.041: mysql_enable_utf8 not set or 0: nothing (default) mysql_enable_utf8 set to 1:

  1. "SET NAMES utf8" at connect? Yes;
  2. Input statement and bind parameters encoded? No
  3. Retrieved columns decoded? Yes

Working of 4.042: mysql_enable_utf8 not set or 0: nothing (default) mysql_enable_utf8 set to 1:

  1. "SET NAMES utf8" at connect? Yes;
  2. Input statement and bind parameters encoded? Yes
  3. Retrieved columns decoded? Yes

Proposal for 4.043: mysql_enable_utf8 not set or 0: nothing (default) mysql_enable_utf8 set to 1:

  1. "SET NAMES utf8" at connect? Yes;
  2. Input statement and bind parameters enoded? No
  3. Retrieved columns decoded? Yes mysql_enable_utf8 set to 2:
  4. "SET NAMES utf8" at connect? Yes;
  5. Input statement and bind parameters enoded? Yes
  6. Retrieved columns decoded? Yes

nightlight1 avatar Jun 28 '17 15:06 nightlight1

@nightlight1 unfortunately it is not that simple. In the old behavior, the decision whether input parameters were encoded or not is based on the internal state of the perl string, not a clear user visible reason. This is part of the bug that was fixed. Also, retrieved columns were decoded in different circumstances as well. Regardless, you are on the right track, the backwards compatible fix to this would be to restore the behavior of mysql_enable_utf8 and mysql_enable_utf8mb4, and add a new option or option value for the correct behavior.

Grinnz avatar Jun 28 '17 15:06 Grinnz

@Grinnz And that is still not enough! Even without mysql_enable_utf8 DBD::mysql damage input strings. See this example with old DBD::mysql:

$ perl -MDBI -MData::Dumper -e '$Data::Dumper::Useqq = 1; my $dbh = DBI->connect("dbi:mysql:test", "test", undef); my $str2 = "\301" . "\N{U+10FF}"; my $str = substr($str2, 0, 1); print Dumper $dbh->selectall_arrayref("SELECT ?", {}, $str)'
$VAR1 = [
          [
            "\303\201"
          ]
        ];

You put "\301" on input and you woud get "\303\201" on output.

With new DBD::mysql you get correct "\301".

So you cannot support both old buggy behavior and also new perl-correct. I have no idea what else can be done there. Trying to support old buggy behavior would mean to introduce either new bugs or break non-UTF-8 support at all.

pali avatar Jun 28 '17 16:06 pali

Then it sounds like the bugfix with mysql_enable_utf8 disabled has not caused any problems so far, and that can remain. It is only the behavior with mysql_enable_utf8/mb4 changing that has caused the issues in this thread. Would it be possible to restore the old behavior with those switches, and add a new switch or value for the fixed UTF-8 behavior?

Grinnz avatar Jun 28 '17 16:06 Grinnz

Every day this current version is distributed it causes database corruption in production environments (every INSERT/UPDATE statement involving BLOB's causes corruption). In the event it is a challenge to find a solution it would be better for the time being to rollback the current (better) UTF8 implementation and put these changes back in a "development release" until all undesired side effects are covered.

nightlight1 avatar Jun 28 '17 17:06 nightlight1

It is not possible to avoid this "side effect" or "fix" it to recognize blob inputs, for reasons that were discussed earlier in this thread. The old way only accidentally worked for this use case and cannot be relied upon for a real fix. The only way to make it both correct for new code and bugwards compatible for old code is to put the correct behavior on a separate option.

Grinnz avatar Jun 28 '17 17:06 Grinnz

Absolutely. And this is what I want to propose at this point and actually put out there soon.

-- Michiel

Op wo 28 jun. 2017 om 19:06 schreef Dan Book [email protected]

It is not possible to avoid this "side effect" or "fix" it to recognize blob inputs, for reasons that were discussed earlier in this thread. The old way only accidentally worked for this use case and cannot be relied upon for a real fix. The only way to make it both correct for new code and bugwards compatible for old code is to put the correct behavior on a separate option.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/perl5-dbi/DBD-mysql/issues/117#issuecomment-311724830, or mute the thread https://github.com/notifications/unsubscribe-auth/AAoQMBFUQPh0WCRKycrFwcgzDzU7hMsYks5sIogdgaJpZM4M2DZu .

mbeijen avatar Jun 28 '17 17:06 mbeijen