firebird icon indicating copy to clipboard operation
firebird copied to clipboard

Metadata incompatibility between single-byte charsets and UTF8 charset

Open arkinform opened this issue 6 months ago • 10 comments

In our application we have huge databases with thousands of tables in WIN1251 charset. Recently we decided to add full unicode support for our applications and we have encountered several "blocking" issues:

  1. Record size limit. We have a lot of tables with many varchar fields and in UTF8 they start to "reserve" 4x more space comparing with single-byte charset, and many tables can't be created in UTF8 database because of the record size limit. Hopefully it's going to be removed in 6.0 https://github.com/FirebirdSQL/firebird/pull/7332

  2. In UTF8 we can't use varchar with more than 8191 chars limit. We can argue a lot about how it can be replaced with blobs, refactored, etc. but the point is that in huge legacy systems it's very hard to implement such changes. Especially if you have hundreds on-premise installations with thousands custom reports, integrations, etc.

  3. Indexes on long varchars can't be created. For example, with single byte charset it's possible to create index on varchar(1024), but not with UTF8 charset. Of course, we are not going to search using full length, but we need to search "starting" and keep storing long strings, sometimes in healthcare systems it's really needed. Splitting into several fields or creating calculated indexes are not always feasible solutions. @dyemanov mentioned that there are some ideas how it can be "fixed". It would be useful to have a link to the task on GitHub.

So, let's focus on the main blocking issue which is still relevant in 6.0 - varchar limit 8091 in UTF8 charset. First obvious approach is to increase internal varchar byte limit for UTF8 charset keeping current 32k byte limit for single-byte charsets. It is the best solution which ensures metadata compatibility and equal char limits between different charsets.

If increasing internal varchar byte limit is not technically possible in the current state of Firebird project, then dynamic byte limit can be considered as a temporary measure until Firebird has such technical capabilities. For example, with dynamic limit you can store up to 32k latin characters in UTF8 varchar, but only 16k cyrillic characters in the same varchar, etc. In real practical cases this will increase effective varchar limit for most European languages with mixed latin characters usage from 8k to more than 16k chars. It's much more acceptable for Unicode migration in many cases and at least at the declaration level it makes database DDL completely interchangeable between single-byte charsets and UTF8 charset.

arkinform avatar Jun 19 '25 12:06 arkinform

Because ODS is changed in Firebird 6 anyway it is possible to store character limit for fields and make second listed option an effective solution.

aafemt avatar Jun 19 '25 13:06 aafemt

With UTF8 it's also possible to index VARCHAR(1024), provided you use a page size of 32KiB.

In any case, fudging the lengths to accept longer values is not a real solution, because a lot of drivers and other tools (and possibly internals of Firebird) rely on the ability to derive the maximum length in characters from the maximum length in bytes.

mrotteveel avatar Jun 19 '25 13:06 mrotteveel

  1. Indexes on long varchars can't be created. For example, with single byte charset it's possible to create index on varchar(1024), but not with UTF8 charset. Of course, we are not going to search using full length, but we need to search "starting" and keep storing long strings, sometimes in healthcare systems it's really needed. Splitting into several fields or creating calculated indexes are not always feasible solutions. @dyemanov mentioned that there are some ideas how it can be "fixed". It would be useful to have a link to the task on GitHub.

https://github.com/FirebirdSQL/firebird/issues/2629

dyemanov avatar Jun 19 '25 14:06 dyemanov

So, let's focus on the main blocking issue which is still relevant in 6.0 - varchar limit 8091 in UTF8 charset. First obvious approach is to increase internal varchar byte limit for UTF8 charset keeping current 32k byte limit for single-byte charsets. It is the best solution which ensures metadata compatibility and equal char limits between different charsets.

Obviously you have never worked with the Firebird API, because internally you can increase the maximum size of a VARCHAR, but on the client side it will still be limited to 32K or at best 64K bytes. The reason is that in the structure describing the VARCHAR the length is a two-byte integer. Changing the API will affect rewriting your application and the access components you use.

sim1984 avatar Jun 19 '25 14:06 sim1984

@sim1984

Obviously you have never worked with the Firebird API, because internally you can increase the maximum size of a VARCHAR, but on the client side it will still be limited to 32K or at best 64K bytes.

It does not mean that it should not be and cannot be changed. There are always ways to implement new API features without breaking compatibility with old clients. Yes, it will require some enhancements in client libraries and access components. The point is that such incompatibility is a bad approach and it should be addressed sooner or later.

For example, recently we migrated our huge Delphi project to our own implementation of database access components which for Firebird use internally implementation from UIB components https://github.com/zedalaye/uib/. So, it is quite feasible for us to enhance our low level code according to some changes in fbclient library and Firebird API. And this is not a problem at all for Java or Python applications, which will simply switch to a newer version of the standard Firebird client libraries.

arkinform avatar Jun 19 '25 14:06 arkinform

And this is not a problem at all for Java or Python applications, which will simply switch to a newer version of the standard Firebird client libraries.

By default, Jaybird uses the wire protocol, not the Firebird client libraries. However, the native and embedded "protocol" that Jaybird also has still relies on the legacy API, and would definitely break with something like this.

And please, don't assume changes like this are "simple", because in practice they are really not.

mrotteveel avatar Jun 19 '25 14:06 mrotteveel

@mrotteveel

And please, don't assume changes like this are "simple", because in practice they are really not.

My previous reply was for the following point:

Changing the API will affect rewriting your application and the access components you use.

I mentioned nothing about simplicity for core Firebird developers and Firebird standard libraries' developers. Yes, any such changes will affect them. But I'll repeat that it should be changed sooner or later because it is obvious inconsistency.

arkinform avatar Jun 19 '25 14:06 arkinform

fudging the lengths to accept longer values is not a real solution, because a lot of drivers and other tools (and possibly internals of Firebird) rely on the ability to derive the maximum length in characters from the maximum length in bytes.

Firebird internals is the simplest part. Drivers and tools still to be able derive the limit in characters from the limit in bytes, it is only the result for them will be smaller than could be. This miscalculation is safe because these drivers and tool shall refuse long lines that (otherwise) Firebird would accept and this is not a Firebird problem.

aafemt avatar Jun 21 '25 11:06 aafemt

You're missing a lot of use cases for string length determination. For example, I also use it to report the maximum allowed length in truncation errors when setting parameters, and I also use it to determine the correct string to return for CHAR columns (because, for example, a CHAR(1) in UTF8 will be padded with up to 3 spaces.

Part of that can be addressed by having a separate info item to report the maximum length in characters of a column or parameter. However, that doesn't address that the proposed solution would have varying maximum lengths depending on which characters you put in.

mrotteveel avatar Jun 21 '25 11:06 mrotteveel

And no, I wouldn't mind higher limits, or something like SQL Server's VARCHAR(MAX), but we shouldn't implement hacky and brittle solutions that will more than likely break something in our ecosystem.

mrotteveel avatar Jun 21 '25 11:06 mrotteveel