target-bigquery icon indicating copy to clipboard operation
target-bigquery copied to clipboard

Unable to use the generate_views when using fixed format data schema

Open jcbmllgn opened this issue 1 year ago • 0 comments

I'm running into an issue where this BigQuery target is unable to load data which contains nested JSON.

I have the following config:

  loaders:
  - name: target-bigquery
    variant: z3z1ma
    pip_url: git+https://github.com/z3z1ma/target-bigquery.git
    config:
      project: analytics-384214
      upsert: true
      schema_resolver_version: 2
      method: batch_job
      timeout: 1200
      generate_view: true

      column_name_transforms:
        quote: true
        add_underscore_when_invalid: true

Here's a small sample of the data coming from tap-postgres:

{"type": "SCHEMA", "stream": "public-channel_participants", "schema": {"type": "object", "properties": {"id": {"type": ["string"]}, "participant_id": {"type": ["null", "string"]}, "participant_type": {"type": ["null", "string"]}, "message_channel_id": {"type": ["null", "string"]}, "created_at": {"type": ["null", "string"], "format": "date-time"}, "updated_at": {"type": ["null", "string"], "format": "date-time"}, "discarded_at": {"type": ["null", "string"], "format": "date-time"}, "sources": {"type": ["null", "array"], "items": {"$ref": "#/definitions/sdc_recursive_string_array"}}, "discarded_reason": {"type": ["null", "string"]}, "created_by_id": {"type": ["null", "string"]}}, "definitions": {"sdc_recursive_integer_array": {"type": ["null", "integer", "array"], "items": {"$ref": "#/definitions/sdc_recursive_integer_array"}}, "sdc_recursive_number_array": {"type": ["null", "number", "array"], "items": {"$ref": "#/definitions/sdc_recursive_number_array"}}, "sdc_recursive_string_array": {"type": ["null", "string", "array"], "items": {"$ref": "#/definitions/sdc_recursive_string_array"}}, "sdc_recursive_boolean_array": {"type": ["null", "boolean", "array"], "items": {"$ref": "#/definitions/sdc_recursive_boolean_array"}}, "sdc_recursive_timestamp_array": {"type": ["null", "string", "array"], "format": "date-time", "items": {"$ref": "#/definitions/sdc_recursive_timestamp_array"}}, "sdc_recursive_object_array": {"type": ["null", "object", "array"], "items": {"$ref": "#/definitions/sdc_recursive_object_array"}}}}, "key_properties": ["id"], "bookmark_properties": []}
{"type": "RECORD", "stream": "public-channel_participants", "record": {"created_at": "2022-08-26T07:02:47.150141+00:00", "created_by_id": null, "discarded_at": null, "discarded_reason": null, "id": "a6e343be-9044-4c88-88c5-53e6101c7ccd", "message_channel_id": "ae44d92a-a18f-4553-ae24-67e8a5751e76", "participant_id": "717b15f5-164f-4747-a6b3-23221e64f2b7", "participant_type": "Chart", "sources": ["iehp"], "updated_at": "2022-10-19T07:04:09.323666+00:00"}, "version": 1682583863391, "time_extracted": "2023-04-27T08:24:23.391786Z"}
{"type": "RECORD", "stream": "public-channel_participants", "record": {"created_at": "2022-05-27T19:02:53.032560+00:00", "created_by_id": null, "discarded_at": null, "discarded_reason": null, "id": "834abe65-87d9-403d-b793-ae6510a091df", "message_channel_id": "02a8b9f2-274c-4706-b1ce-6fd8f81b1415", "participant_id": "d4446cf2-9ff9-4b5c-8b3a-fc303cfe9905", "participant_type": "Chart", "sources": [], "updated_at": "2022-05-27T19:02:53.032560+00:00"}, "version": 1682583863391, "time_extracted": "2023-04-27T08:24:23.391786Z"}

Here are the errors from the logs when i run meltano run tap-postgres target-bigquery:

2023-04-27T10:19:19.277544Z [info     ] Traceback (most recent call last): cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.278220Z [info     ]   File "/Users/jacob/code/pairteam/analytics-etl/.meltano/loaders/target-bigquery/venv/bin/target-bigquery", line 8, in <module> cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.278453Z [info     ]     sys.exit(TargetBigQuery.cli()) cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.278631Z [info     ]   File "/Users/jacob/code/pairteam/analytics-etl/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/click/core.py", line 1130, in __call__ cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.279208Z [info     ]     return self.main(*args, **kwargs) cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.279378Z [info     ]   File "/Users/jacob/code/pairteam/analytics-etl/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/click/core.py", line 1055, in main cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.279461Z [info     ]     rv = self.invoke(ctx)      cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.279524Z [info     ]   File "/Users/jacob/code/pairteam/analytics-etl/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/click/core.py", line 1404, in invoke cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.279588Z [info     ]     return ctx.invoke(self.callback, **ctx.params) cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.279698Z [info     ]   File "/Users/jacob/code/pairteam/analytics-etl/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/click/core.py", line 760, in invoke cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.279757Z [info     ]     return __callback(*args, **kwargs) cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.279837Z [info     ]   File "/Users/jacob/code/pairteam/analytics-etl/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/singer_sdk/target_base.py", line 578, in cli cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.279965Z [info     ]     target.listen(file_input)  cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.280024Z [info     ]   File "/Users/jacob/code/pairteam/analytics-etl/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/singer_sdk/io_base.py", line 34, in listen cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.280078Z [info     ]     self._process_lines(file_input) cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.280128Z [info     ]   File "/Users/jacob/code/pairteam/analytics-etl/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/singer_sdk/target_base.py", line 278, in _process_lines cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.280177Z [info     ]     counter = super()._process_lines(file_input) cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.280226Z [info     ]   File "/Users/jacob/code/pairteam/analytics-etl/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/singer_sdk/io_base.py", line 78, in _process_lines cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.281028Z [info     ]     self._process_schema_message(line_dict) cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.281206Z [info     ]   File "/Users/jacob/code/pairteam/analytics-etl/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/singer_sdk/target_base.py", line 378, in _process_schema_message cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.281264Z [info     ]     _ = self.get_sink(         cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.281718Z [info     ]   File "/Users/jacob/code/pairteam/analytics-etl/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/target_bigquery/target.py", line 482, in get_sink cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.281804Z [info     ]     return self.add_sink(stream_name, schema, key_properties) cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.281869Z [info     ]   File "/Users/jacob/code/pairteam/analytics-etl/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/singer_sdk/target_base.py", line 240, in add_sink cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.281924Z [info     ]     sink = sink_class(         cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.281977Z [info     ]   File "/Users/jacob/code/pairteam/analytics-etl/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/target_bigquery/batch_job.py", line 101, in __init__ cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.282535Z [info     ]     super().__init__(*args, **kwargs) cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.282589Z [info     ]   File "/Users/jacob/code/pairteam/analytics-etl/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/target_bigquery/core.py", line 303, in __init__ cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.282779Z [info     ]     self.create_target(key_properties=key_properties) cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.282839Z [info     ]   File "/Users/jacob/code/pairteam/analytics-etl/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/tenacity/__init__.py", line 289, in wrapped_f cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.283045Z [info     ]     return self(f, *args, **kw) cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.283336Z [info     ]   File "/Users/jacob/code/pairteam/analytics-etl/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/tenacity/__init__.py", line 379, in __call__ cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.283413Z [info     ]     do = self.iter(retry_state=retry_state) cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.284951Z [info     ]   File "/Users/jacob/code/pairteam/analytics-etl/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/tenacity/__init__.py", line 314, in iter cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.285092Z [info     ]     return fut.result()        cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.285153Z [info     ]   File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/concurrent/futures/_base.py", line 438, in result cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.293900Z [info     ]     return self.__get_result() cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.293978Z [info     ]   File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/concurrent/futures/_base.py", line 390, in __get_result cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.294997Z [info     ]     raise self._exception      cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.295084Z [info     ]   File "/Users/jacob/code/pairteam/analytics-etl/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/tenacity/__init__.py", line 382, in __call__ cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.295137Z [info     ]     result = fn(*args, **kwargs) cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.298823Z [info     ]   File "/Users/jacob/code/pairteam/analytics-etl/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/target_bigquery/core.py", line 471, in create_target cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.298893Z [info     ]     self.client.query(         cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.299758Z [info     ]   File "/Users/jacob/code/pairteam/analytics-etl/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/google/cloud/bigquery/job/query.py", line 1520, in result cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.299844Z [info     ]     do_get_result()            cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.299923Z [info     ]   File "/Users/jacob/code/pairteam/analytics-etl/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/google/api_core/retry.py", line 349, in retry_wrapped_func cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.299974Z [info     ]     return retry_target(       cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.300024Z [info     ]   File "/Users/jacob/code/pairteam/analytics-etl/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/google/api_core/retry.py", line 191, in retry_target cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.300473Z [info     ]     return target()            cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.300579Z [info     ]   File "/Users/jacob/code/pairteam/analytics-etl/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/google/cloud/bigquery/job/query.py", line 1510, in do_get_result cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.300708Z [info     ]     super(QueryJob, self).result(retry=retry, timeout=timeout) cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.301373Z [info     ]   File "/Users/jacob/code/pairteam/analytics-etl/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/google/cloud/bigquery/job/base.py", line 911, in result cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.302222Z [info     ]     return super(_AsyncJob, self).result(timeout=timeout, **kwargs) cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.302303Z [info     ]   File "/Users/jacob/code/pairteam/analytics-etl/.meltano/loaders/target-bigquery/venv/lib/python3.9/site-packages/google/api_core/future/polling.py", line 261, in result cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.302659Z [info     ]     raise self._exception      cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres
2023-04-27T10:19:19.302774Z [info     ] google.api_core.exceptions.BadRequest: 400 Invalid cast from STRING to JSON at [12:24] cmd_type=elb consumer=True name=target-bigquery-arc-postgres producer=False stdio=stderr string_id=target-bigquery-arc-postgres

I'm believe there's a bug somewhere in the generate_view_statement function in core.py, more specifically in how the sources column in the above records is parse and turned into a view. Here's the view SQL generated by generate_view_statement, this SQL does not compile in BigQuery:

SELECT
    JSON_VALUE(data, '$.id') as `id`,
    JSON_VALUE(data, '$.participant_id') as `participant_id`,
    JSON_VALUE(data, '$.participant_type') as `participant_type`,
    JSON_VALUE(data, '$.message_channel_id') as `message_channel_id`,
    CAST(JSON_VALUE(data, '$.created_at') as TIMESTAMP) as `created_at`,
    CAST(JSON_VALUE(data, '$.updated_at') as TIMESTAMP) as `updated_at`,
    CAST(JSON_VALUE(data, '$.discarded_at') as TIMESTAMP) as `discarded_at`,
        ARRAY(
          SELECT   CAST(JSON_VALUE(sources__rows, '$.sources') as JSON) as `sources`
          FROM UNNEST(
              JSON_QUERY_ARRAY(data, '$.sources')
          ) AS sources__rows
          WHERE   CAST(JSON_VALUE(sources__rows, '$.sources') as JSON) IS NOT NULL
        ) AS sources,
    JSON_VALUE(data, '$.discarded_reason') as `discarded_reason`,
    JSON_VALUE(data, '$.created_by_id') as `created_by_id`,
    _sdc_batched_at as _sdc_batched_at,
    _sdc_extracted_at as _sdc_extracted_at,
    _sdc_deleted_at as _sdc_deleted_at,
    _sdc_received_at as _sdc_received_at,
    _sdc_table_version as _sdc_table_version,
    _sdc_sequence as _sdc_sequence,
 FROM `analytics-384214`.`source_arc_postgres_dev`.`public_channel_participants`

Here's the error I get when i copy/paste the generated SQL into BigQuery:

image

The error goes away when I set generate_view to false. However, this isn't ideal because the data is of course much easier to work with if denormalized.

Let me know if there's any other context I can provide! I would like to be able to use the package with generate_view: true.

jcbmllgn avatar Apr 27 '23 10:04 jcbmllgn