Inconsistent conversion of non-TEXT blobs in BLOB_APPEND
The conversion of non-TEXT blobs in BLOB_APPEND is not consistent, that is the behaviour for the first argument is different compared to subsequent arguments.
As a simple example, against the employee database:
select BLOB_APPEND(rdb$view_blr, null) from rdb$relations where rdb$relation_name = 'PHONE_LIST';
Results in mojibake, something like:
♣C☻EMPLOYEE☺J
DEPARTMENT☻G/↨☺DEPT_NO↨☻DEPT_NO L
Reversing the order of arguments applies the conversion from BLR to its textual representation:
select BLOB_APPEND(null, rdb$view_blr) from rdb$relations where rdb$relation_name = 'PHONE_LIST';
Results in:
blr_version5,blr_rse, 2, blr_relation, 8, 'E','M','P','L','O','Y','E','E', 1, blr_relation, 10, 'D','E','P','A','R','T','M','E','N','T', 2, blr_boolean, blr_eql, blr_field, 1, 7, 'D','E','P','T','_','N','O', blr_field, 2, 7, 'D','E','P','T','_','N','O', blr_end, blr_eoc
In other words, it seems for the second and higher arguments, blob filters to type TEXT (at least, I believe this is a built-in blob filter) are invoked, while for the first argument the binary content of the blob is used as-is.
Based on the discussion on firebird-devel, it seems the first BLOB argument should have determined the type of the result of BLOB_APPEND, instead of using BLOB SUB_TYPE TEXT CHARACTER SET OCTETS for non-text blobs.
Based on the discussion on firebird-devel, it seems the first BLOB argument should have determined the type of the result of
BLOB_APPEND, instead of usingBLOB SUB_TYPE TEXT CHARACTER SET OCTETSfor non-text blobs.
Now BLOB_APPEND result is defined as:
- if first argument is blob, then use its blob subtype and charset
- if first argument is [var]char, then use it charset and create BLOB SUB_TYPE TEXT
- if first argument is NULL or not a blob nor [var]char, then create BLOB SUB_TYPE TEXT with charset of connection.
PS it not uses CHARACTER SET OCTETS before, it was charset of connection.
PS it not uses CHARACTER SET OCTETS before, it was charset of connection.
I looked at it with connection character set NONE and UTF8, and in both cases for non-text blobs, Firebird 4.0.2 uses OCTETS as the character set, not the connection character set.
BTW: The current commit is on master, shouldn't it also go into v4.0-release?
PS it not uses CHARACTER SET OCTETS before, it was charset of connection.
I looked at it with connection character set
NONEandUTF8, and in both cases for non-text blobs, Firebird 4.0.2 usesOCTETSas the character set, not the connection character set.
Yes, indeed. So, what is better for NULL in 1st arg - create text blob with charset NONE or with current connection charset ?
Using the connection character set for the NULL case sounds fine.
I would suggest to derive the type from the second argument if the first is NULL. Otherwise it will be impossible to gather a binary BLOB without creating of an empty BLOB stub.
@aafemt The appropriate way would be to use BLOB_APPEND(bin_blob, NULL) instead of BLOB_APPEND(NULL, bin_blob).
Are maybe make the second parameter optional (instead of currently, the third and beyond), and allow BLOB_APPEND(bin_blob).
Except it will append data to the bin_blob, no.
@aafemt Only if bin_blob is already a temporary blob marked as BLB_close_on_read.
And this returns us to the original discussion when I complained about inconsistent unpredictable behavior regarding modification of the first argument.
I would suggest to derive the type from the second argument if the first is NULL. Otherwise it will be impossible to gather a binary BLOB without creating of an empty BLOB stub.
I tend to agree.
The connection charset has no relation with things happening inside the engine. It's for conversion of things to clients (possibly each one using a different charset).
On 8/14/22 15:53, Adriano dos Santos Fernandes wrote:
I would suggest to derive the type from the second argument if the first is NULL. Otherwise it will be impossible to gather a binary BLOB without creating of an empty BLOB stub.I tend to agree.
+1 Using type of second argument is more predictable and reliable behavior.
The connection charset has no relation with things happening inside the engine. It's for conversion of things to clients (possibly each one using a different charset).
Looks like there is agreement to derive result blob type and charset from first non-NULL argument, correct ?
I.e. BLOB_APPEND result should be defined as:
- if first non-NULL argument is blob, then use its blob subtype and charset
- if first non-NULL argument is [var]char, then use its charset and create BLOB SUB_TYPE TEXT
- if first non-NULL argument is not a blob nor [var]char, then create BLOB SUB_TYPE TEXT CHARACTER SET NONE
- if all arguments is NULL, then return NULL
For varchar character set OCTETS a binary BLOB should be created IMHO in the second case. Perhaps, in the third case charset UTF-8 should be considered.
Looks like there is agreement to derive result blob type and charset from first non-NULL argument, correct ?
What about same rules of COALESCE?
I second that (VAR)BINARY (a.k.a. (VAR)CHAR CHARACTER SET OCTETS should resolve to BLOB SUB_TYPE BINARY, not to BLOB SUB_TYPE TEXT CHARACTER SET OCTETS.
Looks like there is agreement to derive result blob type and charset from first non-NULL argument, correct ?
What about same rules of
COALESCE?
If you speak about DataTypeUtilBase::makeFromList - it can't be used as is for BLOB_APPEND.
Also, it tries to find "common" datatype for all items in list while for BLOB_APPEND it is not known in advance - as BLOB_APPEND could be used many times with the same blob.
So, I still think the best we can do is to derive result type from first non-null argument. Taking into account rules used for COALESE, it looks like:
- if first non-NULL argument is blob, then use its blob subtype and charset
- if first non-NULL argument is [var]char with charset NONE or OCTETS, then create BLOB SUB_TYPE BINARY
- if first non-NULL argument is [var]char with charset other than NONE or OCTETS, then use its charset and create BLOB SUB_TYPE TEXT
- if first non-NULL argument is not a blob nor [var]char, then create BLOB SUB_TYPE TEXT CHARACTER SET ASCII
- if all arguments is NULL, then return NULL
I think (VAR)CHAR CHARACTER SET NONE should result in a BLOB SUB_TYPE TEXT CHARACTER SET NONE, not in BLOB SUB_TYPE BINARY. Only CHARACTER SET OCTETS should result in SUB_TYPE BINARY.
If, for example, I use connection character set NONE, and have a string literal as the first argument of BLOB_APPEND, I expect a text blob result, not a binary blob.
Ok, lets not convert NONE to the BINARY. So, the rules are:
- if first non-NULL argument is blob, then use its blob subtype and charset
- if first non-NULL argument is [var]char with charset OCTETS, then create BLOB SUB_TYPE BINARY
- if first non-NULL argument is [var]char with charset other than OCTETS, then use its charset and create BLOB SUB_TYPE TEXT
- if first non-NULL argument is not a blob nor [var]char, then create BLOB SUB_TYPE TEXT CHARACTER SET ASCII
- if all arguments is NULL, then return NULL
Correct ?
I have my reservations about using ASCII for non-blob/non-text, it might be a bit restrictive. On the other hand, it is easy to workaround.
ASCII is to follow our COALESCE rules (that claims to follow standard). Please, check next snapshot build.