firebird icon indicating copy to clipboard operation
firebird copied to clipboard

TRIM function is wrongly working in collate UNICODE_CI_AI

Open luronumen opened this issue 2 years ago • 4 comments

Actual Result

  • TRIM function is wrongly working in collate UNICODE_CI_AI: Some characters at the end of the string are wrongly being removed

Firebird_Trim_Issue_1

Expected Result

  • The TRIM function should only remove leading and/or trailing spaces from the input string: See TRIM() documentation

Steps to reproduce this issue

  1. Download the attached database: BOTSERVICE_CIN.zip
  2. In the attached database run the following query:
SELECT
  *
FROM
(
  SELECT
    FEEDITEMS.TITLE,
    TRIM(FEEDITEMS.TITLE) AS TRIM_TITLE
  FROM
    FEEDITEMS
)
WHERE
  TITLE<>TRIM_TITLE

Firebird_Trim_Issue_2

Important Notes

  • This issue is NOT reproducible in collate UTF8:
SELECT
  *
FROM
(
  SELECT
    FEEDITEMS.TITLE,
    TRIM(FEEDITEMS.TITLE COLLATE UTF8) AS TRIM_TITLE
  FROM
    FEEDITEMS
)
WHERE
  TITLE<>TRIM_TITLE

Firebird_Trim_Issue_3

Other Info

  • Firebird Version: 3.0.10.33587 (win-64)
  • For more details please check the [firebird-support] TRIM function is removing characters instead of just spaces email

luronumen avatar May 23 '22 00:05 luronumen

The problem is that the strings are in Unicode NFD normalization form - for example instead of have á encoded as a single character (with two bytes), it's encoded as a followed by two bytes related to '.

This does not work well for all operations in Firebird.

In the case of TRIM with UNICODE_CI_AI, it converts the original string (expected to be in NFC) to NFD, remove accents and re-encode in NFC. That changes the length of the string.

Accordingly to the standard it may not be considered as a bug:

"An SQL-implementation may assume that all UCS strings are normalized in one of Normalization Form C (NFC), Normalization Form D (NFD), Normalization Form KC (NFKC), or Normalization Form KD (NFKD), as specified by [Unicode15]. may be used to verify the normalization form to which a particular UCS string conforms. Applications may also use to enforce a particular . With the exception of and , the result of any operation on an unnormalized UCS string is implementation-defined."

An workaround is to always send NFC strings to Firebird.

asfernandes avatar May 23 '22 12:05 asfernandes

In the case of TRIM with UNICODE_CI_AI, it converts the original string (expected to be in NFC) to NFD, remove accents and re-encode in NFC. That changes the length of the string

Could you explain, why it removes non-space chars also ? If it is due to wrong string length after re-encoding - could it be re-evaluated ?

An workaround is to always send NFC strings to Firebird.

Seems as a good time to introduce some built-in functions to work with normalization.

hvlad avatar May 23 '22 14:05 hvlad

Could you explain, why it removes non-space chars also ? If it is due to wrong string length after re-encoding - could it be re-evaluated ?

The NFD string has its length calculated counting the diacritic marks as own characters.

When we process the string (which involves re-encoding as NFC as then going to UTF-32) the processed string has its length counted differently (which the diacritic combined with the letter). Trimming is done comparing that UTF-32 string. Later substring is applied to the NFD string and them things are incorrect.

An workaround is to always send NFC strings to Firebird.

Seems as a good time to introduce some built-in functions to work with normalization.

Standard has it and I'm in favor to introduce it.

But this is surely better handled at client applications directly with the inputted strings or things read from files or internet (like it seems to happened in this case).

asfernandes avatar May 23 '22 17:05 asfernandes

Hi @asfernandes

An workaround is to always send NFC strings to Firebird.

I just test this workaround using the String.Normalize Method before passing the string to the ADO.NET Provider and the issue was no longer reproducible.

Thank you very much for debugging the issue!

Best Regards, Luciano

luronumen avatar May 23 '22 17:05 luronumen