bottledwater-pg
bottledwater-pg copied to clipboard
Better representation of timestamps
Postgres has the following date and time datatypes, with the internal representation for each:
-
date
(DATEOID = 1082): 32-bit integer, 1 day resolution -
time without time zone
(TIMEOID = 1083): 64-bit integer (microseconds since start of day) -
time with time zone
(TIMETZOID = 1266): 64-bit integer (microseconds since start of day) plus a 32-bit integer (seconds offset from UTC) -
timestamp without time zone
(TIMESTAMPOID = 1114): 64-bit integer (microseconds since epoch) -
timestamp with time zone
(TIMESTAMPTZOID = 1184): 64-bit integer (microseconds since epoch) -
interval
(INTERVALOID = 1186): a 64-bit integer for microseconds, a 32-bit integer for days, and a 32-bit integer for months.
There are a few curious nitty-gritty details about this:
- The epoch for the
timestamp
types is 1 January 2000, not 1 January 1970 like in POSIX timestamps. -
time with time zone
has an explicit timezone offset field, buttimestamp with time zone
does not! In fact,timestamp {with,without} time zone
have the same internal representation; the difference is only whether the value gets converted into the client's local timezone on output. -
interval
breaks the days and months into separate fields, which makes sense, since "1 month" may be anything between 28 and 31 days, and "1 day" may not equal 24 hours (due to DST changes). Years are represented as multiples of 12 months. - Leap seconds do not feature in any of the above. Like POSIX, Postgres pretends that leap seconds do not exist.
- Postgres uses the Gregorian calendar (even though the comments in the source falsely state that it uses the Julian calendar). This means that leap years are consistent between Postgres and POSIX: years that are divisible by 100 but not divisible by 400 are counted as non-leapyears.
- Postgres supports special values
infinity
and-infinity
(which are stored as special timestamp values).
Bottled Water's Avro encoding currently uses a slightly funky representation of date/time datatypes. For example, a column of type timestamp with time zone
is currently represented as the following:
[
"null",
{
"type": "record",
"name": "DateTime",
"namespace": "com.martinkl.bottledwater.datatypes",
"fields": [
{"name": "year", "type": "int"},
{"name": "month", "type": "int"},
{"name": "day", "type": "int"},
{"name": "hour", "type": "int"},
{"name": "minute", "type": "int"},
{"name": "second", "type": "int"},
{"name": "micro", "type": "int"},
{"name": "zoneOffset", "type": "int"}
]
},
{
"type": "enum",
"name": "SpecialTime",
"symbols": ["POS_INFINITY", "NEG_INFINITY"]
}
]
That representation is expressive, and in some sense "correct" (it preserves all the information in the datatype), but it's perhaps a bit confusing to work with if you were simply expecting a Unix timestamp. So perhaps we should change the representation. The most obvious alternatives to the above record structure would be:
- ISO8601 string: also breaks down the value into year-month-day-etc, but in an ASCII encoding rather than an Avro record. This is the form in which data usually gets in and out of the database (since it is used by SQL). It's human-readable, but the downside is that it's fairly verbose (~26 bytes, compared to 8 bytes for a timestamp).
- Unix timestamp. For a Unix timestamp, a 1970-01-01 epoch would be the least surprising. But what should the resolution be? Unix uses seconds, Java uses milliseconds, Postgres uses microseconds. If we use microseconds with a 1970 epoch, the overflow of the Avro encoding would be a little different from Postgres' internal overflow. One option would be to put the microseconds fraction in a separate field, but then we have a more complex datatype again.
Does anybody have opinions on what the format should be? In principle we could make this configurable, but it would be better to just decide on one good encoding and stick with it.
My proposal for a new encoding is as follows:
-
timestamp {with,without} time zone
are represented as 64-bit microseconds since the 1970 epoch.-infinity
andinfinity
are represented as -2^63 and 2^63-1, respectively. We accept that there are a few timestamps on which this conversion overflows, because Postgres' epoch is 30 years later. We always report the timestamp as UTC, regardless of the client time zone setting. -
date
,time {with,without} time zone
andinterval
are represented as they are internally in Postgres (see above).
Alternative arguments welcome.
Another consideration would be integration with Kafka Connect, for which we'll want to map the Avro data model into the Kafka Connect data model. However, Kafka Connect doesn't define any timestamp datatypes either, only 32-bit or 64-bit integers and structs, so compatibility with Kafka Connect doesn't really change anything.
@ept Kafka Connect does have some logical types for this already Date, Time, and Timestamp. Note that Kafka Connect's Timestamp is sans timezone and the underlying representation is just Unix time. I think it would probably also be good to define a similar logical type for an ISO 8601/RFC 3339 string encoding, which covers timezone and is very widely parseable, although I think even that has limitations since I think it only uses numeric offsets for timezones which won't always capture complete info about the timestamp.
More generally my view is that as much as programmers would like dates and times to be as simple as Unix time, they aren't and you can't get away with something that simple. It's unfortunate that it requires so much effort, but probably supporting, at a minimum, something like ISO 8601 (which, btw, is what a lot of REST APIs do since they just encode dates as strings), will give reasonable coverage for use cases and supporting a few more variants will make people's lives a lot easier.
In this case, I'm a less sure of the best option because I don't think Bottled Water has much in the way of configs yet? Without being able to control this you'd have to choose a good default. I'd err on the side of preserving as much info as possible while remaining standard, which I guess means ISO 8601.
@ewencp Thanks for your comments, and for pointing out the logical types! I had missed those. Bottled Water does have support for passing in options, so it would be possible to offer several different representations. I've just tried to keep the number of options to a minimum so far.
@ept Agreed on keeping them to a minimum. Every one we add to Kafka Connect is yet another type that every connector either has to handle or, by not handling the logical type, results in loss of information or a less-than-ideal representation in the downstream system (datetimes aren't bad for this, but Decimals just end up as a byte[]
...).
If you really want the minimum, I"d say ISO 8601. I hate having to give people strings to parse, but it's probably the most widely usable solution that preserves a reasonable accuracy wrt the source data.