iglu
iglu copied to clipboard
Provide special treatment for human-written text
In JSON Schema specification maxLength
keyword specifies the amount of characters in field, not bytes. However, in Redshift (as in most databases, I believe) in VARCHAR
we specify amount of bytes, which may introduce mismatch in text-fields, usually written by humans, not computers.
This is generally not a problem as analytical data typically contains ASCII-text (written by computers), where amount of bytes precisely match amount of characters.
But at the same time, I can imagine an issue:
- User supposed to enter his/her city name in native language (likely with non-ASCII characters)
- Web-developer constrains input-field to 32 characters
- Analysts makes a wrong assumption that
maxLength: 32
is correct constrain - Redshift truncates all non-ASCII city names to 16 characters
This could be done as part of https://github.com/snowplow/iglu/issues/170 (format: "unicode"
, which specify that string has absolutely no structure) or similar custom JSON-schema extension.
I would like to add to point 3 that it is not necessarily a mistake or assumption.
In our case we use the schema as source of truth and generate DataTransferObjects from the schema, in order to comply with the schema.
In other words, if the schema defines a specific maxLength
expressed in characters, then that is what the application (and developer) honors and should be able to trust to be stored as the schema describes.
Maybe igluctl
can "detect" if specific storage engines need over provisioning based on the schema requirements and thus generate "larger" fields in the generated DDL (Redschift sql schema) files?
Hey @kasper-agg,
Yes I agree that using JSON Schemas as source of truth is a great idea and this issue prevents engineers from doing that. I also agree that it would be a right thing to do to over-provision Redshift, but problem here is that we need to come up with way to "mark" these fields as it is relatively narrow subset of cases.
Regardless of the solution, I must agree it sounds like a challenge. Even more to fix it in clean way...
I'm quite keen to add custom extensions into JSON Schema (example https://github.com/snowplow/iglu/issues/170, but new properties as well - it is fully-compatible with standard) and I think this is the way to go here.
I have a weird feeling we may end up having to drive a Draft 5 (?) of the JSON Schema spec, given there don't seem to be many other vendors taking a major interest in it...
That would be lovely! I've seen efforts on Draft 5 specification, but almost no implementations out there.