datahub icon indicating copy to clipboard operation
datahub copied to clipboard

Bigquery Stats Ingestion

Open FelipeArruda opened this issue 2 years ago • 6 comments

Describe the bug Using Bigquery connection Stats aren't collected by default.

To Reproduce With default yaml connection through the UI, unable to collect Stats from bigquery tables.

    type: bigquery
    config:
        env: qa
        credential:
            private_key_id: hidden
            project_id: hidden
            client_email: hidden
            private_key: "hidden"
            client_id: 'hidden'
        profiling:
            enabled: true
            profile_table_level_only: true
        include_table_lineage: true
        include_usage_statistics: true
        stateful_ingestion:
            enabled: false

No error occur in log, the job finish with success.

Expected behavior Collect Stats from bigquery tables and display in the interface. PostgreSQL collect Stats normal with the default yaml, expected the same behavior to bigquery.

Screenshots If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: Windows 10
  • Browser chrome
  • Datahub Version : 0.9.1

Additional context No error occur on log

FelipeArruda avatar Nov 24 '22 13:11 FelipeArruda

@FelipeArruda, you have to tweak these properties as I think it fails to collect these stats because none of the tables are eligible for profiling. (with the example values there it will profile every table) ->

      profile_if_updated_since_days: null
      profile_table_size_limit: null
      profile_table_row_limit: null

I have an oper pr to collect table-level stats even if a table is not eligible for profiling which will land soon.

treff7es avatar Nov 28 '22 15:11 treff7es

@treff7es, thank you for your replie. I changed the connection but it didn't worked, still not collecting stats with this one.

Here is my tweak:

source:
    type: bigquery
    config:
        include_tables: true
        capture_table_label_as_tag: true
        credential:
            private_key_id: hidden
            project_id: hidden
            client_email: hidden
            private_key: "hidden"
            client_id: 'hidden'
        include_usage_statistics: true
        profile_pattern:
            allow:
                - '.*'
            ignoreCase: true
        profiling:
            enabled: true
            max_workers: 1
            profile_table_level_only: true
            include_field_sample_values: true
            profile_if_updated_since_days: null
            profile_table_size_limit: null
            profile_table_row_limit: null
        stateful_ingestion:
            enabled: false
        include_views: true
        column_limit: 10000
        env: qa
        include_table_lineage: true
        dataset_pattern:
            deny:                
                - temp

FelipeArruda avatar Nov 28 '22 18:11 FelipeArruda

@FelipeArruda : profile_table_level_only: true should be set to false to collect column-level stats, but now row counts should be visible on the top of the stats page.

treff7es avatar Nov 28 '22 18:11 treff7es

@treff7es, I changed to false and still not collecting stats and no showing up row counts. I needed change profile_table_row_limit: null to profile_table_row_limit: 50000 bc I got an error.

image

Here its the error I got with table_row_limit: null:

"[2022-11-28 20:42:57,662] ERROR    {datahub.entrypoints:206} - Command failed: Can't load plugin: sqlalchemy.dialects:bigquery\n"
           'Traceback (most recent call last):\n'
           '  File "/tmp/datahub/ingest/venv-bigquery-0.9.2/lib/python3.10/site-packages/datahub/entrypoints.py", line 164, in main\n'
           '    sys.exit(datahub(standalone_mode=False, **kwargs))\n'
           '  File "/tmp/datahub/ingest/venv-bigquery-0.9.2/lib/python3.10/site-packages/click/core.py", line 1130, in __call__\n'
           '    return self.main(*args, **kwargs)\n'
           '  File "/tmp/datahub/ingest/venv-bigquery-0.9.2/lib/python3.10/site-packages/click/core.py", line 1055, in main\n'
           '    rv = self.invoke(ctx)\n'
           '  File "/tmp/datahub/ingest/venv-bigquery-0.9.2/lib/python3.10/site-packages/click/core.py", line 1657, in invoke\n'
           '    return _process_result(sub_ctx.command.invoke(sub_ctx))\n'
           '  File "/tmp/datahub/ingest/venv-bigquery-0.9.2/lib/python3.10/site-packages/click/core.py", line 1657, in invoke\n'
           '    return _process_result(sub_ctx.command.invoke(sub_ctx))\n'
           '  File "/tmp/datahub/ingest/venv-bigquery-0.9.2/lib/python3.10/site-packages/click/core.py", line 1404, in invoke\n'
           '    return ctx.invoke(self.callback, **ctx.params)\n'
           '  File "/tmp/datahub/ingest/venv-bigquery-0.9.2/lib/python3.10/site-packages/click/core.py", line 760, in invoke\n'
           '    return __callback(*args, **kwargs)\n'
           '  File "/tmp/datahub/ingest/venv-bigquery-0.9.2/lib/python3.10/site-packages/click/decorators.py", line 26, in new_func\n'
           '    return f(get_current_context(), *args, **kwargs)\n'
           '  File "/tmp/datahub/ingest/venv-bigquery-0.9.2/lib/python3.10/site-packages/datahub/telemetry/telemetry.py", line 347, in wrapper\n'
           '    raise e\n'
           '  File "/tmp/datahub/ingest/venv-bigquery-0.9.2/lib/python3.10/site-packages/datahub/telemetry/telemetry.py", line 299, in wrapper\n'
           '    res = func(*args, **kwargs)\n'
           '  File "/tmp/datahub/ingest/venv-bigquery-0.9.2/lib/python3.10/site-packages/datahub/utilities/memory_leak_detector.py", line 95, in '
           'wrapper\n'
           '    return func(ctx, *args, **kwargs)\n'
           '  File "/tmp/datahub/ingest/venv-bigquery-0.9.2/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 192, in run\n'
           '    loop.run_until_complete(run_func_check_upgrade(pipeline))\n'
           '  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete\n'
           '    return future.result()\n'
           '  File "/tmp/datahub/ingest/venv-bigquery-0.9.2/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 151, in '
           'run_func_check_upgrade\n'
           '    ret = await the_one_future\n'
           '  File "/tmp/datahub/ingest/venv-bigquery-0.9.2/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 142, in '
           'run_pipeline_async\n'
           '    return await loop.run_in_executor(\n'
           '  File "/usr/local/lib/python3.10/concurrent/futures/thread.py", line 58, in run\n'
           '    result = self.fn(*self.args, **self.kwargs)\n'
           '  File "/tmp/datahub/ingest/venv-bigquery-0.9.2/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 133, in '
           'run_pipeline_to_completion\n'
           '    raise e\n'
           '  File "/tmp/datahub/ingest/venv-bigquery-0.9.2/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 125, in '
           'run_pipeline_to_completion\n'
           '    pipeline.run()\n'
           '  File "/tmp/datahub/ingest/venv-bigquery-0.9.2/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 344, in run\n'
           '    for wu in itertools.islice(\n'
           '  File "/tmp/datahub/ingest/venv-bigquery-0.9.2/lib/python3.10/site-packages/datahub/ingestion/source/bigquery_v2/bigquery.py", line '
           '511, in get_workunits\n'
           '    yield from self.profiler.get_workunits(self.db_tables)\n'
           '  File "/tmp/datahub/ingest/venv-bigquery-0.9.2/lib/python3.10/site-packages/datahub/ingestion/source/bigquery_v2/profiler.py", line '
           '181, in get_workunits\n'
           '    for request, profile in self.generate_profiles(\n'
           '  File "/tmp/datahub/ingest/venv-bigquery-0.9.2/lib/python3.10/site-packages/datahub/ingestion/source/bigquery_v2/profiler.py", line '
           '366, in generate_profiles\n'
           '    ge_profiler = self.get_profiler_instance()\n'
           '  File "/tmp/datahub/ingest/venv-bigquery-0.9.2/lib/python3.10/site-packages/datahub/ingestion/source/bigquery_v2/profiler.py", line '
           '323, in get_profiler_instance\n'
           '    engine = create_engine(url, **self.config.options)\n'
           '  File "/tmp/datahub/ingest/venv-bigquery-0.9.2/lib/python3.10/site-packages/sqlalchemy/engine/__init__.py", line 525, in create_engine\n'
           '    return strategy.create(*args, **kwargs)\n'
           '  File "/tmp/datahub/ingest/venv-bigquery-0.9.2/lib/python3.10/site-packages/sqlalchemy/engine/strategies.py", line 61, in create\n'
           '    entrypoint = u._get_entrypoint()\n'
           '  File "/tmp/datahub/ingest/venv-bigquery-0.9.2/lib/python3.10/site-packages/sqlalchemy/engine/url.py", line 172, in _get_entrypoint\n'
           '    cls = registry.load(name)\n'
           '  File "/tmp/datahub/ingest/venv-bigquery-0.9.2/lib/python3.10/site-packages/sqlalchemy/util/langhelpers.py", line 277, in load\n'
           '    raise exc.NoSuchModuleError(\n'

FelipeArruda avatar Nov 28 '22 20:11 FelipeArruda

Updating my source:

source:
    type: bigquery
    config:
        include_tables: true
        capture_table_label_as_tag: true
        credential:
            private_key_id: hidden
            project_id: hidden
            client_email: hidden
            private_key: "hidden"
            client_id: 'hidden'
        include_usage_statistics: true
        profile_pattern:
            allow:
                - '.*'
            ignoreCase: true
        profiling:
            enabled: true
            max_workers: 10
            profile_table_level_only: false
            include_field_sample_values: true
            profile_if_updated_since_days: null
            profile_table_size_limit: null
            profile_table_row_limit: 50000
        stateful_ingestion:
            enabled: false
        include_views: true
        column_limit: 10000
        env: qa
        include_table_lineage: true
        dataset_pattern:
            deny:
                - temp

FelipeArruda avatar Nov 28 '22 20:11 FelipeArruda

@FelipeArruda please, can you check with our latest rc release (0.9.2.5rc5) client? Recently fixed a couple of issues with it and it will go out in the next release.

treff7es avatar Nov 28 '22 21:11 treff7es

Fixed with new release 0.9.3

FelipeArruda avatar Dec 07 '22 12:12 FelipeArruda