bety icon indicating copy to clipboard operation
bety copied to clipboard

PL/pgSQL function "normalize_whitespace" doesn't uniformly account for NO-BREAK SPACE characters

Open gsrohde opened this issue 6 years ago • 0 comments

The normalize_whitespace function treats NO-BREAK SPACE characters as white space on some systems in some configurations and ignores it on others.

This can lead to the following problem: Several check constraints use the is_whitespace_normalized function to ensure attribute values do not contain leading, trailing, or consecutive whitespace characters. This function uses normalize_whitespace to check whether a value changes when the function is applied; if it doesn't, the value is already normalized.

Many (all?) of the deployed BETYdb instances are on machines that don't treat NO-BREAK SPACE characters as whitespace. Thus, these characters can occur in whitespace-checked column values without violating the is_whitespace_normalized-based constraint. If a database that does treat NO-BREAK WHITESPACE characters as whitespace tries to import from one of these instances, the import will fail because values being imported don't comply with the constraint as interpreted by the importing database. (It seems that MacOS systems and maybe all BSD-based systems behave this way.)

(Note that the Rails code normalizes whitespace for many or all of these columns upon creation or update, so the problem presumably arises only if NO-BREAK SPACE characters are inserted by manipulating the database directly.)

Proposed fix

Replace the function definition

CREATE FUNCTION normalize_whitespace(string text) RETURNS text
    LANGUAGE plpgsql
    AS $$
DECLARE
  result text;
BEGIN
  SELECT TRIM(REGEXP_REPLACE(string, '\s+', ' ', 'g')) INTO result;
  RETURN result;
END;
$$;

with

CREATE FUNCTION normalize_whitespace(string text) RETURNS text
    LANGUAGE plpgsql
    AS $$
DECLARE
  result text;
BEGIN
  SELECT TRIM(REGEXP_REPLACE(string, '[\s\u00a0]+', ' ', 'g')) INTO result;
  RETURN result;
END;
$$;

Here, I have only changed the regular expression in the REGEXP_REPLACE function call.

gsrohde avatar May 23 '18 20:05 gsrohde