firebird
firebird copied to clipboard
TRIM function is wrongly working in collate UNICODE_CI_AI
Actual Result
- TRIM function is wrongly working in collate UNICODE_CI_AI: Some characters at the end of the string are wrongly being removed
Expected Result
- The TRIM function should only remove leading and/or trailing spaces from the input string: See TRIM() documentation
Steps to reproduce this issue
- Download the attached database: BOTSERVICE_CIN.zip
- In the attached database run the following query:
SELECT
*
FROM
(
SELECT
FEEDITEMS.TITLE,
TRIM(FEEDITEMS.TITLE) AS TRIM_TITLE
FROM
FEEDITEMS
)
WHERE
TITLE<>TRIM_TITLE
Important Notes
- This issue is NOT reproducible in collate UTF8:
SELECT
*
FROM
(
SELECT
FEEDITEMS.TITLE,
TRIM(FEEDITEMS.TITLE COLLATE UTF8) AS TRIM_TITLE
FROM
FEEDITEMS
)
WHERE
TITLE<>TRIM_TITLE
Other Info
- Firebird Version: 3.0.10.33587 (win-64)
- For more details please check the [firebird-support] TRIM function is removing characters instead of just spaces email
The problem is that the strings are in Unicode NFD normalization form - for example instead of have á
encoded as a single character (with two bytes), it's encoded as a
followed by two bytes related to '
.
This does not work well for all operations in Firebird.
In the case of TRIM with UNICODE_CI_AI, it converts the original string (expected to be in NFC) to NFD, remove accents and re-encode in NFC. That changes the length of the string.
Accordingly to the standard it may not be considered as a bug:
"An SQL-implementation may assume that all UCS strings are normalized in one of Normalization Form C (NFC), Normalization Form D (NFD), Normalization Form KC (NFKC), or Normalization Form KD (NFKD),
as specified by [Unicode15].
An workaround is to always send NFC strings to Firebird.
In the case of TRIM with UNICODE_CI_AI, it converts the original string (expected to be in NFC) to NFD, remove accents and re-encode in NFC. That changes the length of the string
Could you explain, why it removes non-space chars also ? If it is due to wrong string length after re-encoding - could it be re-evaluated ?
An workaround is to always send NFC strings to Firebird.
Seems as a good time to introduce some built-in functions to work with normalization.
Could you explain, why it removes non-space chars also ? If it is due to wrong string length after re-encoding - could it be re-evaluated ?
The NFD string has its length calculated counting the diacritic marks as own characters.
When we process the string (which involves re-encoding as NFC as then going to UTF-32) the processed string has its length counted differently (which the diacritic combined with the letter). Trimming is done comparing that UTF-32 string. Later substring is applied to the NFD string and them things are incorrect.
An workaround is to always send NFC strings to Firebird.
Seems as a good time to introduce some built-in functions to work with normalization.
Standard has it and I'm in favor to introduce it.
But this is surely better handled at client applications directly with the inputted strings or things read from files or internet (like it seems to happened in this case).
Hi @asfernandes
An workaround is to always send NFC strings to Firebird.
I just test this workaround using the String.Normalize Method before passing the string to the ADO.NET Provider and the issue was no longer reproducible.
Thank you very much for debugging the issue!
Best Regards, Luciano