iglu icon indicating copy to clipboard operation
iglu copied to clipboard

Provide special treatment for human-written text

Open chuwy opened this issue 7 years ago • 6 comments

In JSON Schema specification maxLength keyword specifies the amount of characters in field, not bytes. However, in Redshift (as in most databases, I believe) in VARCHAR we specify amount of bytes, which may introduce mismatch in text-fields, usually written by humans, not computers.

This is generally not a problem as analytical data typically contains ASCII-text (written by computers), where amount of bytes precisely match amount of characters.

But at the same time, I can imagine an issue:

  1. User supposed to enter his/her city name in native language (likely with non-ASCII characters)
  2. Web-developer constrains input-field to 32 characters
  3. Analysts makes a wrong assumption that maxLength: 32 is correct constrain
  4. Redshift truncates all non-ASCII city names to 16 characters

This could be done as part of https://github.com/snowplow/iglu/issues/170 (format: "unicode", which specify that string has absolutely no structure) or similar custom JSON-schema extension.

chuwy avatar Aug 24 '17 16:08 chuwy

I would like to add to point 3 that it is not necessarily a mistake or assumption.

In our case we use the schema as source of truth and generate DataTransferObjects from the schema, in order to comply with the schema. In other words, if the schema defines a specific maxLength expressed in characters, then that is what the application (and developer) honors and should be able to trust to be stored as the schema describes.

Maybe igluctl can "detect" if specific storage engines need over provisioning based on the schema requirements and thus generate "larger" fields in the generated DDL (Redschift sql schema) files?

kasper-agg avatar Aug 25 '17 09:08 kasper-agg

Hey @kasper-agg,

Yes I agree that using JSON Schemas as source of truth is a great idea and this issue prevents engineers from doing that. I also agree that it would be a right thing to do to over-provision Redshift, but problem here is that we need to come up with way to "mark" these fields as it is relatively narrow subset of cases.

chuwy avatar Aug 25 '17 09:08 chuwy

Regardless of the solution, I must agree it sounds like a challenge. Even more to fix it in clean way...

kasper-agg avatar Aug 25 '17 09:08 kasper-agg

I'm quite keen to add custom extensions into JSON Schema (example https://github.com/snowplow/iglu/issues/170, but new properties as well - it is fully-compatible with standard) and I think this is the way to go here.

chuwy avatar Aug 25 '17 09:08 chuwy

I have a weird feeling we may end up having to drive a Draft 5 (?) of the JSON Schema spec, given there don't seem to be many other vendors taking a major interest in it...

alexanderdean avatar Aug 25 '17 09:08 alexanderdean

That would be lovely! I've seen efforts on Draft 5 specification, but almost no implementations out there.

chuwy avatar Aug 25 '17 09:08 chuwy