monosi icon indicating copy to clipboard operation
monosi copied to clipboard

Timestamp columns inferred incorrectly

Open sreeladas opened this issue 2 years ago • 0 comments

Timestamp columns are incorrectly inferred on nested columns

On (at least) a bigquery dataset, when using tables that have nested/repeated columns that contain a timestamp element (i.e. column type array<timestamp, timestamp>), the field is incorrectly inferred as being a timestamp field due to containing timestamp in the type name. The first encountered timestamp field is used to calculate freshness here which errors out and causes the monitor for the table to fail.

Based on the source code, similar behaviour can be expected for columns of type e.g. struct<int, timestamp>, or or type struct/array with nested date/datetime-like fields

Expected behavior

I would expect one of two things to happen:

  1. The nested field is ignored for freshness calculation and if no other timestamp-like column is found, no freshness monitor is created for the table
  2. The nested field is unnested and the first field within that is used to calculate freshness. This is more complex and so more error-prone.

Steps to reproduce

  1. Create table with a nested timestamp/date-like field
  2. [Optional] create a control table without the nested structure
  3. Add the data source that includes the table with the nested field a. Navigate to /sources, b. Click "Create Data Source" c. Fill in credentials for the data source
  4. Wait for a few minutes to allow the monitors to run
  5. Run docker logs <CONTAINER_NAME> and check for progress/errors
  6. See error -- This should look something like:
ERROR:root:(google.cloud.bigquery.dbapi.exceptions.DatabaseError) 400 Invalid cast from ARRAY<STRUCT<start TIMESTAMP, end TIMESTAMP>> to TIMESTAMP at [37:25]
...

[SQL:
            SELECT
                CURRENT_TIMESTAMP() as `WINDOW_START`,
                CURRENT_TIMESTAMP() as `WINDOW_END`,
                COUNT(*) as `ROW_COUNT`,
                'xx' as `TABLE_NAME`,
                'xx' as `DATABASE_NAME`,
                'xx' as `SCHEMA_NAME`,
...
TIMESTAMP_DIFF(MAX(CAST(windows AS TIMESTAMP)), CURRENT_TIMESTAMP(), MINUTE) AS windows___freshness
            FROM xx.xx.xx;
        ]
(Background on this error at: https://sqlalche.me/e/14/4xp6)

sreeladas avatar May 13 '22 02:05 sreeladas