datahub OOM (Out of Memory Errors) in DataHub 0.13+ when ingesting Bigquery metadata

Describe the bug

We have an ingestion job, running periodically in Kubernetes, it runs fine with DataHub 0.12.x versions. 0 12 0_Datahub

You can see the memory stays stable under 1GiB during the execution of the job.

However, with DataHub 0.13.x versions it always fail with Error 137, out of memory. OOM_137

We have tried to increase the memory to 20GiB, but there must be a memory leak, because it alwasys run out of memory.

To Reproduce Steps to reproduce the behavior:

Kubernetes CronJob set up
- 2Gi memory
- 2 CPU
Our deployment:

            - name: bigquery
              image: 'acryldata/datahub-ingestion:v0.13.3'
              imagePullPolicy: Always
              args: ["ingest", "-c", "/recipes/bq-recipe.yml"]

Note, we have also tried, the latest release from the latest commit from master, and the issue is still present.

Our BigQuery recipe

            source:
              type: bigquery
              config:
                project_on_behalf: {{ bq_slots_project }}
                project_id_pattern:
                  allow:
                    - .*{{ gcp_project }}
                dataset_pattern:
                  allow:
                    - {{ profile_dataset }}
                  deny:
                    - ^temp_.*
                    - .*_temp$
                    - .*_temp_.*
                    - .*-temp.*
                    - .*temporary.*
        
                use_exported_bigquery_audit_metadata: true
                bigquery_audit_metadata_datasets:
                  - {{ gcp_project }}.bigquery_audit_log
                use_date_sharded_audit_log_tables: true
                upstream_lineage_in_report: true
        
                include_usage_statistics: true
        
                capture_table_label_as_tag: true
                capture_dataset_label_as_tag: true
                extract_column_lineage: true
                convert_urns_to_lowercase: true
        
                profiling:
                  enabled: "true"
                  profile_table_size_limit: null
                  profile_table_row_limit: null

                  use_sampling: false
                  partition_profiling_enabled: false
                  include_field_mean_value: false
                  include_field_median_value: false
                  include_field_sample_values: false
                  include_field_stddev_value: false

        
                stateful_ingestion:
                  enabled: true
                  state_provider:
                    type: "datahub"
                    config:
                      datahub_api:
                        server: "http://our-gms:8080"
        
            pipeline_name: {{ pipeline }}
            sink:
              type: "datahub-rest"
              config:
                server: "http://our-gms:8080"

Expected behavior Not having a Out of Memory error in DataHub 0.13.3

Screenshots If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: Google Cloud GKE/Kubernetes
Version: DataHub 0.13.x

Additional context Logs with some summary after a successful execution with DataHub 0.12.x

Pipeline finished with at least 5 warnings; produced 33546 events in 3 hours, 26 minutes and 30.91 seconds.
{}
 'pending_requests': 0}
 'gms_version': 'v0.13.0',
 'total_duration_in_seconds': 12396.68,
 'current_time': '2024-08-11 03:26:39.897462 (now)',
 'start_time': '2024-08-11 00:00:03.217296 (3 hours, 26 minutes and 36.68 seconds ago)',
 'failures': [],
 'warnings': [],
 'records_written_per_second': 2,
{'total_records_written': 33546,
Sink (datahub-rest) report:
 'running_time': '3 hours, 26 minutes and 30.91 seconds'}
 'start_time': '2024-08-11 00:00:08.984329 (3 hours, 26 minutes and 30.91 seconds ago)',
 'stateful_usage_ingestion_enabled': True,
 'usage_end_time': '2024-08-11 00:00:08.942642+00:00 (3 hours, 26 minutes and 30.95 seconds ago)',
 'usage_start_time': '2024-08-10 00:00:00+00:00 (1 day, 3 hours and 26 minutes ago)',
                                       'extra={}',
                                       "successful lineage ingestion. and will not run lineage ingestion for same timestamps in subsequent run. '"
 'stateful_lineage_ingestion_enabled': "default=True description='Enable stateful lineage ingestion. This will store lineage window timestamps after "
 'lineage_end_time': '2024-08-11 00:00:08.942642+00:00 (3 hours, 26 minutes and 30.95 seconds ago)',
 'lineage_start_time': '2024-08-10 00:00:00+00:00 (1 day, 3 hours and 26 minutes ago)',
----
 'partition_info': {},
                      'sampled': '10 sampled of at most 3059 entries.'},
-----
 'num_usage_parsed_log_entries': {'*': 122613},
 'num_usage_total_log_entries': {'*': 175145},
 'lineage_metadata_entries': {'*': 99},
 'num_lineage_parsed_log_entries': {'*': 108696},
 'num_lineage_total_log_entries': {'*': 108696},
 'num_skipped_lineage_entries_other': {},
 'num_lineage_entries_sql_parser_failure': {'host: 147},
 'num_skipped_lineage_entries_not_allowed': {'host': 90366},
 'num_skipped_lineage_entries_missing_data': {'host': 14177},
 'num_total_lineage_entries': {'host': 108696},
 'audit_log_api_perf': {'get_exported_log_entries': '397.101 seconds', 'list_log_entries': None},
 'total_query_log_entries': 0,
 'sampled': '10 sampled of at most 2880 entries.'},
 'num_usage_parsed_log_entries': {'*': 122613},
 'num_usage_total_log_entries': {'*': 175145},
 'lineage_metadata_entries': {'*': 99},
 'num_lineage_parsed_log_entries': {'*': 108696},
 'num_lineage_total_log_entries': {'*': 108696},
 'num_skipped_lineage_entries_other': {},
 'num_lineage_entries_sql_parser_failure': {'host': 147},
 'num_skipped_lineage_entries_not_allowed': {'host': 90366},
 'num_skipped_lineage_entries_missing_data': {'host': 14177},
 'num_total_lineage_entries': {'host': 108696},

Aug 12 '24 11:08 vejeta

Same behaviour in 0.14.x (tested in 0.14.0.1, 0.14.0.2)

Aug 22 '24 08:08 vejeta

@vejeta could you generate a memory profile - see https://datahubproject.io/docs/metadata-ingestion/docs/dev_guides/profiling_ingestions/

Sep 16 '24 21:09 hsheth2

Thanks, @hsheth2 for the suggestion! I will do that with the latest version (0.14.1)

Sep 18 '24 13:09 vejeta

Just out of curiosity, to see if this is easier, is there a published docker image with the memray dependency included? After enabling the memory profiling in my recipe, I am getting:

ERROR    {datahub.entrypoints:218} - Command failed: No module named 'memray'

It that doesn't exist, no worries, I will see to test it with our own docker image with memray included.

Sep 25 '24 11:09 vejeta

memray is installed when you install the datahub debug sub-package: https://github.com/datahub-project/datahub/blob/cf1d2966da418535ab53826fa51c73ac51290f09/metadata-ingestion/setup.py#L895

Check these docs on how to use it: https://datahubproject.io/docs/metadata-ingestion/docs/dev_guides/profiling_ingestions/#how-to-use (opt for the cli option)

Sep 25 '24 11:09 pedro93

@hsheth2, @pedro93 I have generated the file with the memory profiling. It is 801Mb in size.

Shall I provide the output of running memray over it somehow? What output can be more useful?

Oct 02 '24 01:10 vejeta

For starters, if you could run memray flamegraph file.bin to generate an html file, that'd be helpful. Hopefully that gives us enough info, but we might need the full file if there isn't enough in the flamegraph.

Also - we separately have been working on a bigquery lineage/usage v2 implementation (enabled with use_queries_v2). If the bottleneck is from that, we'll probably ask you to enable the new version since it is a bit more optimized with regards to runtime and memory usage

Oct 02 '24 03:10 hsheth2

@hsheth2 I see. Thanks a lot. This has been generated with 0.13.3.

I tried with 0.14.1 but our GMS is not in 0.14.x yet, so there are some incompatibilities when sending the metadata.

Hopefully, it is enough to locate the bottleneck. If use_queries_v2 is for 0.14.x, we will need to perform an upgrade first, that will take a bit of time.

Attached the flamegraph:

memray-flamegraph-bigquery-2024_10_01-20_23_13.html.gz

Oct 02 '24 09:10 vejeta

@vejeta looking at the flamegraph, it looks like the OOM is being caused by our SQL parser.

If that's the case, use_queries_v2 probably won't help. We made a few tweaks to the SQL parser since 0.13.3 which might help, but I'm not too confident on that.

If you could run with datahub --debug ingest ..., the last ~1000 lines or so of the log should have some additional details about the exact SQL query that we're having issues with.

Oct 02 '24 17:10 hsheth2

Thanks a lot @hsheth2, here it goes the last 4000 lines with --debug.

I had to "anonymize" the field names, hopefully it still can give a clue of what introduced the memory leak from 0.12.x to 0.13+

bigquery_job_4000lines.txt.gz

Oct 09 '24 16:10 vejeta

Hi, we are facing the same issue all info is in ticket https://github.com/datahub-project/datahub/issues/11597 and I attach here also the anonymized logs using debug mode acryl.log

Oct 14 '24 06:10 elopez-lodgify

Now that https://github.com/datahub-project/datahub/pull/11432 and #11645 are both part of 0.14.1.3 - would you guys mind giving this another try?

Oct 18 '24 23:10 hsheth2

Thanks a lot @hsheth2 for the quick feedback on this issue. I am going to work on migrating from 0.13.x to the latest 0.14.x so I can provide feedback on this.

Will the tag 0.14.1.3 be published soon?

Oct 20 '24 17:10 vejeta

That's the CLI version, which is published here https://pypi.org/project/acryl-datahub/0.14.1.3/

Oct 22 '24 17:10 hsheth2

We have upgraded to 0.14.1.

I have tried the datahub-cli version that you recommended, using this in our bigquery ingestion recipe.

containers:
  - name: bigquery
     image: 'acryldata/datahub-ingestion:{{ DATAHUB_INGESTION_VERSION | default('v0.14.1') }}'
     imagePullPolicy: Always
     args: ["ingest", "-c", "/recipes/bq-profile-recipe.yml"]
     command: [ "sh", "-c", "pip install acryl-datahub[bigquery,cloud,debug]==0.14.1.3 && python -m pip install memray && datahub --debug ingest -c /recipes/bq-profile-recipe.yml" ]

That is pulling the required version, and is starting the process as always, but we see a new issue, the same that is described here: https://github.com/datahub-project/datahub/issues/11679

failed to write record with workunit urn:li:container:631fc658010d79ac8c2b5161ec400174-browsePathsV2 with 
('Unable to emit metadata to DataHub GMS, likely because the server version is too old relative to the client: Failed to validate record with class com.linkedin.container.ContainerProperties: 
ERROR :: /env :: unrecognized field found but not allowed\n', {'exceptionClass':

Is there any workaround?

Nov 27 '24 16:11 vejeta

I ran it with the 0.14.1.12 and it ran without the metadata error, but the process was killed due the OOM again. Will send the memory dump generated with memray next.

Dec 13 '24 00:12 vejeta

@vejeta Sorry I missed your earlier message here

Re workaround - you can set the DATAHUB_INCLUDE_ENV_IN_CONTAINER_PROPERTIES=false env variable - it's a bit of a hack we put in to work around that exact issue. https://github.com/datahub-project/datahub/blob/07034caf09a930ad39fc2a594d1503bdf6c34ed8/metadata-ingestion/src/datahub/emitter/mcp_builder.py#L39-L47

I was optimistic on those fixes improving this situation, but I guess not. Both logs and the memory dump from the latest run would be super useful - thanks

Dec 13 '24 04:12 hsheth2

Thanks @hsheth2, no worries, all the hints are really appreciated.

I attach the last flamegraph, with 0.14.1 in the backend(GMS), and 0.14.1.12 on the python's datahub-cli

memray-flamegraph-bigquery-2024_12_13-01_16_09-mweoqb.html.gz

Dec 13 '24 16:12 vejeta

Weirdly, I don't really see any issues on this flamegraph. The peak memory usage is reported as 1.3 GB, and most of it is consumed by network IO

Dec 13 '24 17:12 hsheth2

I see a lot of network issues on this last run, they also appear on the logs, we could have experienced some network outage on this run. I will repeat this process again. I might need some time, I am on vacation. Thanks!

Dec 15 '24 20:12 vejeta

Sorry for the delay @hsheth2,

I had the chance to perform another run. As a recap, our GMS is in 0.14.1, the cronjob uses the 0.12.0 just fine without OOM issues.

I performed another run with memray enabled with 0.15.0.4

command: [ "sh", "-c", "mkdir -p /opt/datahub && python -m pip install memray && pip install acryl-datahub[bigquery,cloud,debug]==0.15.0.4 && datahub --debug ingest -c /recipes/bq-profile-recipe.yml" ]

I couldn´t use 0.15.0.5 as client because it complained about:

ImportError: cannot import name 'IcebergCatalogInfoClass' from 'acryl_datahub_cloud.metadata.schema_classes' (/datahub-ingestion/.venv/lib/python3.10/site-packages/acryl_datahub_cloud/metadata/schema_classes.py)

I attach latest flamegraph and logs:

memray-flamegraph-bigquery-2025_02_18-13_22_36-5jmajx.html.gz bigquery_18000.log.gz

Edit:

Adding a couple of graphs and screenshots taking during the execution, the memory behaviour seems to show a different line than previously, although the result is the same. Pod killed with 137 error code.

Feb 19 '25 11:02 vejeta

We recently migrated to DataHub 1.0.0, both in the GMS, and the datahub-cli, the error is still present, and the OOM appears on this specific job.

Jun 19 '25 10:06 vejeta

The issue keeps happening in 1.2.0 and 1.2.0.1 datahub versions.

Sep 16 '25 13:09 vejeta

datahub datahub copied to clipboard

OOM (Out of Memory Errors) in DataHub 0.13+ when ingesting Bigquery metadata

datahub
datahub copied to clipboard