singer-python icon indicating copy to clipboard operation
singer-python copied to clipboard

Add UTF-8 validity checking to schema

Open KBorders01 opened this issue 3 years ago • 0 comments

For data-type "string", the _transform function just attempts to do str(data) and catches an exception to determine if the string is valid. Binary strings with null bytes or other invalid UTF-8 character sequences will pass through this function as valid strings. However, targets may expect strings to be valid encoded text, such as UTF-8.

UTF-8 encoding validation can be enforced with a pre_hook when calling transform, but this doesn't inform the target about the type of string. It'd be helpful to somehow include character encoding as part of the schema so that downstream targets can know what to expect and choose the appropriate data type. For example, MySQL has TEXT and BLOB types to separately handle text and binary strings. One natural place to put this could be the "format" parameter, though it'd be tedious to have to explicitly specify UTF-8 for every string when that is the default. It'd be convenient to have a way to make UTF-8 the default for all strings in a schema and override it with binary (the current behavior) explicitly for binary fields.

KBorders01 avatar Sep 08 '21 13:09 KBorders01