sqlx icon indicating copy to clipboard operation
sqlx copied to clipboard

fix: (#3387) use collation to determine if string type is compatible

Open DrewMcArthur opened this issue 1 year ago • 3 comments

Does your PR solve an issue?

fixes #3387 fixes #3390

  • added impl TryFrom<u16> for Collation
  • added collation field to TypeInfo
  • updated str compatibility check to compare with Collation::binary instead of ColumnFlags::Binary
  • added a test that covers my case in #3387, failed on main, and passes now. (let me know if we'd like to make this test more robust)

todo

  • [ ] add test case for #3409

DrewMcArthur avatar Aug 02 '24 15:08 DrewMcArthur

@alu This fixes only for MySQL, not for SQLite one. I understand that original question is for MySQL, but SQLite and presumably PostgreSQL are also affected.

Please, tell me if i need to create a separate issue for SQLite.

eirnym avatar Oct 09 '24 12:10 eirnym

can you folks merge this?

alvico avatar Oct 17 '24 14:10 alvico

@eirnym oh good catch - my bandwidth is pretty low right now, so if you wanted to open a PR into my branch with those changes, or even just the tests for them that'd be awesome!

@alvico I'm not sure their release schedule, but you can target this branch in your Cargo.toml if you'd like to have the fix locally.

DrewMcArthur avatar Oct 17 '24 14:10 DrewMcArthur

any news on merging this?

alvico avatar Nov 11 '24 09:11 alvico

Would love to see this merged. Just hit this problem today and the Vec workaround is a mess.

daveajones avatar Mar 21 '25 22:03 daveajones

Sorry it took so long to get to this. I really wanted to get #3383 done for 0.9.0 so this had to wait.

Conflating the method of sorting stored data and the character set being used over the wire is one of several design decisions that has made me come to really loathe MySQL. They could have sent the character set ID instead, for not much extra overhead (it's just one more lookup, probably entirely in-memory), and saved the rest of us a ton of headache.

On the surface, I don't see much point in cataloguing collations. We can't treat them as a closed set because new ones are being added all the time, and they're not always UTF-8 either, so we can't assume anything about a collation we don't recognize. The Collation enum is mostly vestigial.

I was thinking there's only one collation we should care about, and that's binary. Anything else we can treat as tentatively compatible with String. Cause ultimately, even technically incompatible character sets still have some amount of overlap (the lower half of latin1 is still ASCII), so if the actual data is compatible with UTF-8, who are we to tell the user they can't try to decode it?

Ultimately, it's a pretty clear logic error if the user tries to encode or decode a string using a completely incompatible encoding.

I think we could catch this if we really wanted to, but I'm wondering about the ratio of value to effort there.

I guess one thing we should worry about is when a given sequence of bytes is valid for two different encodings, but doesn't actually represent the same character. That's tantamount to silent memory corruption, which would be pretty bad.

We could probably do a similar thing to what we do in Postgres, and query the metadata for a given collation when we see it. Then we can actually check if the character set is compatible, since that's a significantly smaller set of possibilities, and doesn't change much.

We can't do that in every situation, however; during the simple query flow (not using a prepared statement), the column data immediately precedes the data we need to decode, so there's no opportunity to stop and issue another query first.

So ultimately, we probably need to catalogue collations anyway so we don't always have to query them.

But Collation still probably shouldn't be a simple enum over u8 since it's not a closed set. It should likely be closer to the internal representation of PgTypeInfo.

Given that this is a major departure from the current PR and that your last message mentioned you didn't have the bandwidth to finish this, I'm going to close. However, I'm going to follow up with my own PR cause this really needs to be fixed.

By the way, this is not an issue for Postgres or SQLite. SQLite defines strings to always be UTF-8, and Postgres transcodes to/from the character set requested by the client. Any issue in other databases which seems to be related is almost certainly not. Please open a search for an existing issue specific to that database or open a new one.

abonander avatar Jul 06 '25 06:07 abonander