gcp-ingestion icon indicating copy to clipboard operation
gcp-ingestion copied to clipboard

TextIO.read() does not assign timestamps

Open jklukas opened this issue 6 years ago • 0 comments

The filenames we assign for file-based output include the start and end time of the window being written, and by default we windows of size 10 minutes.

The windows, however, are based on event time. Beam assigns this for many inputs like PubsubIO, but TextIO lacks this, and users are encouraged to apply a WithTimestamps to assign some relevant timestamp after reading in files.

Currently, we don't assign timestamps, and we end up with files named like:

out--290308-12-21T20:00:00.000Z--290308-12-21T20:10:00.000Z-00000-of-00001.ndjson

The start of that window is -290308-12-21T20:00:00.000Z which is probably what you get from a joda time Instant constructed from Long.MIN_VALUE.

This is going to cause a problem for batch processing where we read from files. Our output will all fall into one window with that confusing date.

jklukas avatar Nov 30 '18 20:11 jklukas