bigquery-schema-generator icon indicating copy to clipboard operation
bigquery-schema-generator copied to clipboard

[FEATURE] Starting with an existing schema, epoch time cannot mismatched type TIMESTAMP --> INTEGER

Open abroglesc opened this issue 3 years ago • 6 comments

Current Behavior

When starting with an existing bigquery schema that has a TIMESTAMP field in it we get an error when trying to load logs which contain an epoch time as this is detected as an INTEGER and we get the following error:

Error: [{'line_number': 1, 'msg': 'Ignoring field with mismatched type: old=(hard,event_time,NULLABLE,TIMESTAMP); new=(hard,event_time,NULLABLE,INTEGER)'}]

Expected Behavior

If this timestamp matched the correct number of digits for an epoch timestamp which is supported by bigquery we should be able to assume that this INTEGER is in fact a TIMESTAMP and allow it to be maintained as such.

Suggested solution

Add a new if block to the convert_type function which will allow btype = TIMESTAMP and atype = INTEGER to return TIMESTAMP if the integer matches the correct number of digits for an epoch time.

There is added complexity because we do not pass the actual data for the record into this function. We may need to start doing this.

abroglesc avatar Dec 08 '20 19:12 abroglesc

Thinking about this more, since we don't have the full record I think it would be better to in the infer_value_type function infer a value that looks like an epoch time in seconds or milliseconds to be a TIMESTAMP type with a mode of soft then allow TIMESTAMP to be relaxed to INTEGER if the mode is soft, but if the mode is hard error like we are doing now.

abroglesc avatar Dec 08 '20 21:12 abroglesc

I understand it might be useful in your situation, but I'm not sure that a INTEGER -> TIMESTAMP conversion is the correct behavior because bq load --autodetect does not support it.

$ cat timestamp.data.json 
{ "ts": "2017-05-22T17:10:00-07:00" }                                           
{ "ts": 1607464095 }

$ bq load --source_format NEWLINE_DELIMITED_JSON --replace  --autodetect tmp.timestamp timestamp.data.json
Upload complete.
Waiting on bqjob_r4b69e8c388027c20_000001764455f168_1 ... (0s) Current status: DONE   
BigQuery error in load operation: Error processing job 'vital-future-582:bqjob_r4b69e8c388027c20_000001764455f168_1': Error while reading data, error message:
JSON table encountered too many errors, giving up. Rows: 2; errors: 1. Please look into the errors[] collection for more details.
Failure details:
- Error while reading data, error message: JSON processing
encountered too many errors, giving up. Rows: 2; errors: 1; max
bad: 0; error percent: 0
- Error while reading data, error message: JSON parsing error in row
starting at position 81: Could not convert value to string. Field:
ts; Value: 1607464095

bxparks avatar Dec 08 '20 21:12 bxparks

@bxparks I agree autodetection doesn't do this, but BigQuery load does support loading epoch data into a TIMESTAMP field in a schema if you provide the schema during load.

With an added flag for this would you be in favor of me fixing this? Something like --allow_epoch_timestamps?

abroglesc avatar Dec 08 '20 22:12 abroglesc

Oh, I see. Although bq load --autodetect does not support heterogeneous TIMESTAMP | INTEGER values, if the table already exists, and the field is already a TIMESTAMP, then bq load will happily accept an integer into that field. So yes, I think a flag to make generate-schema.py infer INTEGER timestamps would be good. I think I would call it --allow_integer_timestamps to be more descriptive. I don't like it that it will make the code more complex and difficult to maintain... but I can see the value in it, especially if bq load accepts those integers.

bxparks avatar Dec 08 '20 22:12 bxparks

(And you explained that same thing in your comment, but for some reason, the email version of that truncated the 1st paragraph, so I saw only your last sentence.)

bxparks avatar Dec 08 '20 22:12 bxparks

Re: the email, That was my bad where I made a more descriptive edit to the comment. I'll work on implementing this additional flag and i'll make sure to have some tests for it of course.

abroglesc avatar Dec 08 '20 22:12 abroglesc