firebird icon indicating copy to clipboard operation
firebird copied to clipboard

Inconsistent conversion of non-TEXT blobs in BLOB_APPEND

Open mrotteveel opened this issue 3 years ago • 13 comments

The conversion of non-TEXT blobs in BLOB_APPEND is not consistent, that is the behaviour for the first argument is different compared to subsequent arguments.

As a simple example, against the employee database:

select BLOB_APPEND(rdb$view_blr, null) from rdb$relations where rdb$relation_name = 'PHONE_LIST';

Results in mojibake, something like:

♣C☻EMPLOYEE☺J
DEPARTMENT☻G/↨☺DEPT_NO↨☻DEPT_NO L

Reversing the order of arguments applies the conversion from BLR to its textual representation:

select BLOB_APPEND(null, rdb$view_blr) from rdb$relations where rdb$relation_name = 'PHONE_LIST';

Results in:

blr_version5,blr_rse, 2,    blr_relation, 8, 'E','M','P','L','O','Y','E','E', 1,    blr_relation, 10, 'D','E','P','A','R','T','M','E','N','T', 2,    blr_boolean,       blr_eql,          blr_field, 1, 7, 'D','E','P','T','_','N','O',          blr_field, 2, 7, 'D','E','P','T','_','N','O',    blr_end, blr_eoc

In other words, it seems for the second and higher arguments, blob filters to type TEXT (at least, I believe this is a built-in blob filter) are invoked, while for the first argument the binary content of the blob is used as-is.

mrotteveel avatar Aug 11 '22 15:08 mrotteveel

Based on the discussion on firebird-devel, it seems the first BLOB argument should have determined the type of the result of BLOB_APPEND, instead of using BLOB SUB_TYPE TEXT CHARACTER SET OCTETS for non-text blobs.

mrotteveel avatar Aug 11 '22 16:08 mrotteveel

Based on the discussion on firebird-devel, it seems the first BLOB argument should have determined the type of the result of BLOB_APPEND, instead of using BLOB SUB_TYPE TEXT CHARACTER SET OCTETS for non-text blobs.

Now BLOB_APPEND result is defined as:

  • if first argument is blob, then use its blob subtype and charset
  • if first argument is [var]char, then use it charset and create BLOB SUB_TYPE TEXT
  • if first argument is NULL or not a blob nor [var]char, then create BLOB SUB_TYPE TEXT with charset of connection.

PS it not uses CHARACTER SET OCTETS before, it was charset of connection.

hvlad avatar Aug 14 '22 10:08 hvlad

PS it not uses CHARACTER SET OCTETS before, it was charset of connection.

I looked at it with connection character set NONE and UTF8, and in both cases for non-text blobs, Firebird 4.0.2 uses OCTETS as the character set, not the connection character set.

mrotteveel avatar Aug 14 '22 10:08 mrotteveel

BTW: The current commit is on master, shouldn't it also go into v4.0-release?

mrotteveel avatar Aug 14 '22 10:08 mrotteveel

PS it not uses CHARACTER SET OCTETS before, it was charset of connection.

I looked at it with connection character set NONE and UTF8, and in both cases for non-text blobs, Firebird 4.0.2 uses OCTETS as the character set, not the connection character set.

Yes, indeed. So, what is better for NULL in 1st arg - create text blob with charset NONE or with current connection charset ?

hvlad avatar Aug 14 '22 11:08 hvlad

Using the connection character set for the NULL case sounds fine.

mrotteveel avatar Aug 14 '22 11:08 mrotteveel

I would suggest to derive the type from the second argument if the first is NULL. Otherwise it will be impossible to gather a binary BLOB without creating of an empty BLOB stub.

aafemt avatar Aug 14 '22 11:08 aafemt

@aafemt The appropriate way would be to use BLOB_APPEND(bin_blob, NULL) instead of BLOB_APPEND(NULL, bin_blob).

mrotteveel avatar Aug 14 '22 11:08 mrotteveel

Are maybe make the second parameter optional (instead of currently, the third and beyond), and allow BLOB_APPEND(bin_blob).

mrotteveel avatar Aug 14 '22 11:08 mrotteveel

Except it will append data to the bin_blob, no.

aafemt avatar Aug 14 '22 11:08 aafemt

@aafemt Only if bin_blob is already a temporary blob marked as BLB_close_on_read.

mrotteveel avatar Aug 14 '22 11:08 mrotteveel

And this returns us to the original discussion when I complained about inconsistent unpredictable behavior regarding modification of the first argument.

aafemt avatar Aug 14 '22 11:08 aafemt

I would suggest to derive the type from the second argument if the first is NULL. Otherwise it will be impossible to gather a binary BLOB without creating of an empty BLOB stub.

I tend to agree.

The connection charset has no relation with things happening inside the engine. It's for conversion of things to clients (possibly each one using a different charset).

asfernandes avatar Aug 14 '22 12:08 asfernandes

On 8/14/22 15:53, Adriano dos Santos Fernandes wrote:

I would suggest to derive the type from the second argument if the
first is NULL. Otherwise it will be impossible to gather a binary
BLOB without creating of an empty BLOB stub.

I tend to agree.

+1 Using type of second argument is more predictable and reliable behavior.

The connection charset has no relation with things happening inside the engine. It's for conversion of things to clients (possibly each one using a different charset).

AlexPeshkoff avatar Aug 15 '22 10:08 AlexPeshkoff

Looks like there is agreement to derive result blob type and charset from first non-NULL argument, correct ?

I.e. BLOB_APPEND result should be defined as:

  • if first non-NULL argument is blob, then use its blob subtype and charset
  • if first non-NULL argument is [var]char, then use its charset and create BLOB SUB_TYPE TEXT
  • if first non-NULL argument is not a blob nor [var]char, then create BLOB SUB_TYPE TEXT CHARACTER SET NONE
  • if all arguments is NULL, then return NULL

hvlad avatar Aug 15 '22 11:08 hvlad

For varchar character set OCTETS a binary BLOB should be created IMHO in the second case. Perhaps, in the third case charset UTF-8 should be considered.

aafemt avatar Aug 15 '22 11:08 aafemt

Looks like there is agreement to derive result blob type and charset from first non-NULL argument, correct ?

What about same rules of COALESCE?

asfernandes avatar Aug 15 '22 11:08 asfernandes

I second that (VAR)BINARY (a.k.a. (VAR)CHAR CHARACTER SET OCTETS should resolve to BLOB SUB_TYPE BINARY, not to BLOB SUB_TYPE TEXT CHARACTER SET OCTETS.

mrotteveel avatar Aug 15 '22 11:08 mrotteveel

Looks like there is agreement to derive result blob type and charset from first non-NULL argument, correct ?

What about same rules of COALESCE?

If you speak about DataTypeUtilBase::makeFromList - it can't be used as is for BLOB_APPEND. Also, it tries to find "common" datatype for all items in list while for BLOB_APPEND it is not known in advance - as BLOB_APPEND could be used many times with the same blob.

So, I still think the best we can do is to derive result type from first non-null argument. Taking into account rules used for COALESE, it looks like:

  • if first non-NULL argument is blob, then use its blob subtype and charset
  • if first non-NULL argument is [var]char with charset NONE or OCTETS, then create BLOB SUB_TYPE BINARY
  • if first non-NULL argument is [var]char with charset other than NONE or OCTETS, then use its charset and create BLOB SUB_TYPE TEXT
  • if first non-NULL argument is not a blob nor [var]char, then create BLOB SUB_TYPE TEXT CHARACTER SET ASCII
  • if all arguments is NULL, then return NULL

hvlad avatar Aug 16 '22 15:08 hvlad

I think (VAR)CHAR CHARACTER SET NONE should result in a BLOB SUB_TYPE TEXT CHARACTER SET NONE, not in BLOB SUB_TYPE BINARY. Only CHARACTER SET OCTETS should result in SUB_TYPE BINARY.

mrotteveel avatar Aug 16 '22 16:08 mrotteveel

If, for example, I use connection character set NONE, and have a string literal as the first argument of BLOB_APPEND, I expect a text blob result, not a binary blob.

mrotteveel avatar Aug 16 '22 16:08 mrotteveel

Ok, lets not convert NONE to the BINARY. So, the rules are:

  • if first non-NULL argument is blob, then use its blob subtype and charset
  • if first non-NULL argument is [var]char with charset OCTETS, then create BLOB SUB_TYPE BINARY
  • if first non-NULL argument is [var]char with charset other than OCTETS, then use its charset and create BLOB SUB_TYPE TEXT
  • if first non-NULL argument is not a blob nor [var]char, then create BLOB SUB_TYPE TEXT CHARACTER SET ASCII
  • if all arguments is NULL, then return NULL

Correct ?

hvlad avatar Aug 16 '22 16:08 hvlad

I have my reservations about using ASCII for non-blob/non-text, it might be a bit restrictive. On the other hand, it is easy to workaround.

mrotteveel avatar Aug 17 '22 08:08 mrotteveel

ASCII is to follow our COALESCE rules (that claims to follow standard). Please, check next snapshot build.

hvlad avatar Aug 17 '22 13:08 hvlad