bigquery-schema-generator
bigquery-schema-generator copied to clipboard
[FEATURE] Starting with an existing schema, epoch time cannot mismatched type TIMESTAMP --> INTEGER
Current Behavior
When starting with an existing bigquery schema that has a TIMESTAMP field in it we get an error when trying to load logs which contain an epoch time as this is detected as an INTEGER and we get the following error:
Error: [{'line_number': 1, 'msg': 'Ignoring field with mismatched type: old=(hard,event_time,NULLABLE,TIMESTAMP); new=(hard,event_time,NULLABLE,INTEGER)'}]
Expected Behavior
If this timestamp matched the correct number of digits for an epoch timestamp which is supported by bigquery we should be able to assume that this INTEGER is in fact a TIMESTAMP and allow it to be maintained as such.
Suggested solution
Add a new if block to the convert_type
function which will allow btype = TIMESTAMP and atype = INTEGER to return TIMESTAMP if the integer matches the correct number of digits for an epoch time.
There is added complexity because we do not pass the actual data for the record into this function. We may need to start doing this.
Thinking about this more, since we don't have the full record I think it would be better to in the infer_value_type
function infer a value that looks like an epoch time in seconds or milliseconds to be a TIMESTAMP
type with a mode of soft
then allow TIMESTAMP
to be relaxed to INTEGER
if the mode is soft
, but if the mode is hard
error like we are doing now.
I understand it might be useful in your situation, but I'm not sure that a INTEGER -> TIMESTAMP conversion is the correct behavior because bq load --autodetect
does not support it.
$ cat timestamp.data.json
{ "ts": "2017-05-22T17:10:00-07:00" }
{ "ts": 1607464095 }
$ bq load --source_format NEWLINE_DELIMITED_JSON --replace --autodetect tmp.timestamp timestamp.data.json
Upload complete.
Waiting on bqjob_r4b69e8c388027c20_000001764455f168_1 ... (0s) Current status: DONE
BigQuery error in load operation: Error processing job 'vital-future-582:bqjob_r4b69e8c388027c20_000001764455f168_1': Error while reading data, error message:
JSON table encountered too many errors, giving up. Rows: 2; errors: 1. Please look into the errors[] collection for more details.
Failure details:
- Error while reading data, error message: JSON processing
encountered too many errors, giving up. Rows: 2; errors: 1; max
bad: 0; error percent: 0
- Error while reading data, error message: JSON parsing error in row
starting at position 81: Could not convert value to string. Field:
ts; Value: 1607464095
@bxparks I agree autodetection doesn't do this, but BigQuery load does support loading epoch data into a TIMESTAMP field in a schema if you provide the schema during load.
With an added flag for this would you be in favor of me fixing this? Something like --allow_epoch_timestamps
?
Oh, I see. Although bq load --autodetect
does not support heterogeneous TIMESTAMP | INTEGER values, if the table already exists, and the field is already a TIMESTAMP, then bq load
will happily accept an integer into that field. So yes, I think a flag to make generate-schema.py
infer INTEGER timestamps would be good. I think I would call it --allow_integer_timestamps
to be more descriptive. I don't like it that it will make the code more complex and difficult to maintain... but I can see the value in it, especially if bq load
accepts those integers.
(And you explained that same thing in your comment, but for some reason, the email version of that truncated the 1st paragraph, so I saw only your last sentence.)
Re: the email, That was my bad where I made a more descriptive edit to the comment. I'll work on implementing this additional flag and i'll make sure to have some tests for it of course.