datahub
datahub copied to clipboard
Bigquery Stats Ingestion
Describe the bug Using Bigquery connection Stats aren't collected by default.
To Reproduce With default yaml connection through the UI, unable to collect Stats from bigquery tables.
type: bigquery
config:
env: qa
credential:
private_key_id: hidden
project_id: hidden
client_email: hidden
private_key: "hidden"
client_id: 'hidden'
profiling:
enabled: true
profile_table_level_only: true
include_table_lineage: true
include_usage_statistics: true
stateful_ingestion:
enabled: false
No error occur in log, the job finish with success.
Expected behavior Collect Stats from bigquery tables and display in the interface. PostgreSQL collect Stats normal with the default yaml, expected the same behavior to bigquery.
Screenshots If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
- OS: Windows 10
- Browser chrome
- Datahub Version : 0.9.1
Additional context No error occur on log
@FelipeArruda, you have to tweak these properties as I think it fails to collect these stats because none of the tables are eligible for profiling. (with the example values there it will profile every table) ->
profile_if_updated_since_days: null
profile_table_size_limit: null
profile_table_row_limit: null
I have an oper pr to collect table-level stats even if a table is not eligible for profiling which will land soon.
@treff7es, thank you for your replie. I changed the connection but it didn't worked, still not collecting stats with this one.
Here is my tweak:
source:
type: bigquery
config:
include_tables: true
capture_table_label_as_tag: true
credential:
private_key_id: hidden
project_id: hidden
client_email: hidden
private_key: "hidden"
client_id: 'hidden'
include_usage_statistics: true
profile_pattern:
allow:
- '.*'
ignoreCase: true
profiling:
enabled: true
max_workers: 1
profile_table_level_only: true
include_field_sample_values: true
profile_if_updated_since_days: null
profile_table_size_limit: null
profile_table_row_limit: null
stateful_ingestion:
enabled: false
include_views: true
column_limit: 10000
env: qa
include_table_lineage: true
dataset_pattern:
deny:
- temp
@FelipeArruda : profile_table_level_only: true
should be set to false to collect column-level stats, but now row counts should be visible on the top of the stats page.
@treff7es, I changed to false and still not collecting stats and no showing up row counts. I needed change profile_table_row_limit: null to profile_table_row_limit: 50000 bc I got an error.
Here its the error I got with table_row_limit: null:
"[2022-11-28 20:42:57,662] ERROR {datahub.entrypoints:206} - Command failed: Can't load plugin: sqlalchemy.dialects:bigquery\n"
'Traceback (most recent call last):\n'
' File "/tmp/datahub/ingest/venv-bigquery-0.9.2/lib/python3.10/site-packages/datahub/entrypoints.py", line 164, in main\n'
' sys.exit(datahub(standalone_mode=False, **kwargs))\n'
' File "/tmp/datahub/ingest/venv-bigquery-0.9.2/lib/python3.10/site-packages/click/core.py", line 1130, in __call__\n'
' return self.main(*args, **kwargs)\n'
' File "/tmp/datahub/ingest/venv-bigquery-0.9.2/lib/python3.10/site-packages/click/core.py", line 1055, in main\n'
' rv = self.invoke(ctx)\n'
' File "/tmp/datahub/ingest/venv-bigquery-0.9.2/lib/python3.10/site-packages/click/core.py", line 1657, in invoke\n'
' return _process_result(sub_ctx.command.invoke(sub_ctx))\n'
' File "/tmp/datahub/ingest/venv-bigquery-0.9.2/lib/python3.10/site-packages/click/core.py", line 1657, in invoke\n'
' return _process_result(sub_ctx.command.invoke(sub_ctx))\n'
' File "/tmp/datahub/ingest/venv-bigquery-0.9.2/lib/python3.10/site-packages/click/core.py", line 1404, in invoke\n'
' return ctx.invoke(self.callback, **ctx.params)\n'
' File "/tmp/datahub/ingest/venv-bigquery-0.9.2/lib/python3.10/site-packages/click/core.py", line 760, in invoke\n'
' return __callback(*args, **kwargs)\n'
' File "/tmp/datahub/ingest/venv-bigquery-0.9.2/lib/python3.10/site-packages/click/decorators.py", line 26, in new_func\n'
' return f(get_current_context(), *args, **kwargs)\n'
' File "/tmp/datahub/ingest/venv-bigquery-0.9.2/lib/python3.10/site-packages/datahub/telemetry/telemetry.py", line 347, in wrapper\n'
' raise e\n'
' File "/tmp/datahub/ingest/venv-bigquery-0.9.2/lib/python3.10/site-packages/datahub/telemetry/telemetry.py", line 299, in wrapper\n'
' res = func(*args, **kwargs)\n'
' File "/tmp/datahub/ingest/venv-bigquery-0.9.2/lib/python3.10/site-packages/datahub/utilities/memory_leak_detector.py", line 95, in '
'wrapper\n'
' return func(ctx, *args, **kwargs)\n'
' File "/tmp/datahub/ingest/venv-bigquery-0.9.2/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 192, in run\n'
' loop.run_until_complete(run_func_check_upgrade(pipeline))\n'
' File "/usr/local/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete\n'
' return future.result()\n'
' File "/tmp/datahub/ingest/venv-bigquery-0.9.2/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 151, in '
'run_func_check_upgrade\n'
' ret = await the_one_future\n'
' File "/tmp/datahub/ingest/venv-bigquery-0.9.2/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 142, in '
'run_pipeline_async\n'
' return await loop.run_in_executor(\n'
' File "/usr/local/lib/python3.10/concurrent/futures/thread.py", line 58, in run\n'
' result = self.fn(*self.args, **self.kwargs)\n'
' File "/tmp/datahub/ingest/venv-bigquery-0.9.2/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 133, in '
'run_pipeline_to_completion\n'
' raise e\n'
' File "/tmp/datahub/ingest/venv-bigquery-0.9.2/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 125, in '
'run_pipeline_to_completion\n'
' pipeline.run()\n'
' File "/tmp/datahub/ingest/venv-bigquery-0.9.2/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 344, in run\n'
' for wu in itertools.islice(\n'
' File "/tmp/datahub/ingest/venv-bigquery-0.9.2/lib/python3.10/site-packages/datahub/ingestion/source/bigquery_v2/bigquery.py", line '
'511, in get_workunits\n'
' yield from self.profiler.get_workunits(self.db_tables)\n'
' File "/tmp/datahub/ingest/venv-bigquery-0.9.2/lib/python3.10/site-packages/datahub/ingestion/source/bigquery_v2/profiler.py", line '
'181, in get_workunits\n'
' for request, profile in self.generate_profiles(\n'
' File "/tmp/datahub/ingest/venv-bigquery-0.9.2/lib/python3.10/site-packages/datahub/ingestion/source/bigquery_v2/profiler.py", line '
'366, in generate_profiles\n'
' ge_profiler = self.get_profiler_instance()\n'
' File "/tmp/datahub/ingest/venv-bigquery-0.9.2/lib/python3.10/site-packages/datahub/ingestion/source/bigquery_v2/profiler.py", line '
'323, in get_profiler_instance\n'
' engine = create_engine(url, **self.config.options)\n'
' File "/tmp/datahub/ingest/venv-bigquery-0.9.2/lib/python3.10/site-packages/sqlalchemy/engine/__init__.py", line 525, in create_engine\n'
' return strategy.create(*args, **kwargs)\n'
' File "/tmp/datahub/ingest/venv-bigquery-0.9.2/lib/python3.10/site-packages/sqlalchemy/engine/strategies.py", line 61, in create\n'
' entrypoint = u._get_entrypoint()\n'
' File "/tmp/datahub/ingest/venv-bigquery-0.9.2/lib/python3.10/site-packages/sqlalchemy/engine/url.py", line 172, in _get_entrypoint\n'
' cls = registry.load(name)\n'
' File "/tmp/datahub/ingest/venv-bigquery-0.9.2/lib/python3.10/site-packages/sqlalchemy/util/langhelpers.py", line 277, in load\n'
' raise exc.NoSuchModuleError(\n'
Updating my source:
source:
type: bigquery
config:
include_tables: true
capture_table_label_as_tag: true
credential:
private_key_id: hidden
project_id: hidden
client_email: hidden
private_key: "hidden"
client_id: 'hidden'
include_usage_statistics: true
profile_pattern:
allow:
- '.*'
ignoreCase: true
profiling:
enabled: true
max_workers: 10
profile_table_level_only: false
include_field_sample_values: true
profile_if_updated_since_days: null
profile_table_size_limit: null
profile_table_row_limit: 50000
stateful_ingestion:
enabled: false
include_views: true
column_limit: 10000
env: qa
include_table_lineage: true
dataset_pattern:
deny:
- temp
@FelipeArruda please, can you check with our latest rc release (0.9.2.5rc5) client? Recently fixed a couple of issues with it and it will go out in the next release.
Fixed with new release 0.9.3