datahub icon indicating copy to clipboard operation
datahub copied to clipboard

OOM (Out of Memory Errors) in DataHub 0.13+ when ingesting Bigquery metadata

Open vejeta opened this issue 1 year ago • 12 comments

Describe the bug

We have an ingestion job, running periodically in Kubernetes, it runs fine with DataHub 0.12.x versions. 0 12 0_Datahub

You can see the memory stays stable under 1GiB during the execution of the job.

However, with DataHub 0.13.x versions it always fail with Error 137, out of memory. OOM_137

We have tried to increase the memory to 20GiB, but there must be a memory leak, because it alwasys run out of memory.

To Reproduce Steps to reproduce the behavior:

  • Kubernetes CronJob set up
    • 2Gi memory
    • 2 CPU
  • Our deployment:
            - name: bigquery
              image: 'acryldata/datahub-ingestion:v0.13.3'
              imagePullPolicy: Always
              args: ["ingest", "-c", "/recipes/bq-recipe.yml"]    

Note, we have also tried, the latest release from the latest commit from master, and the issue is still present.

  • Our BigQuery recipe
            source:
              type: bigquery
              config:
                project_on_behalf: {{ bq_slots_project }}
                project_id_pattern:
                  allow:
                    - .*{{ gcp_project }}
                dataset_pattern:
                  allow:
                    - {{ profile_dataset }}
                  deny:
                    - ^temp_.*
                    - .*_temp$
                    - .*_temp_.*
                    - .*-temp.*
                    - .*temporary.*
        
                use_exported_bigquery_audit_metadata: true
                bigquery_audit_metadata_datasets:
                  - {{ gcp_project }}.bigquery_audit_log
                use_date_sharded_audit_log_tables: true
                upstream_lineage_in_report: true
        
                include_usage_statistics: true
        
                capture_table_label_as_tag: true
                capture_dataset_label_as_tag: true
                extract_column_lineage: true
                convert_urns_to_lowercase: true
        
                profiling:
                  enabled: "true"
                  profile_table_size_limit: null
                  profile_table_row_limit: null

                  use_sampling: false
                  partition_profiling_enabled: false
                  include_field_mean_value: false
                  include_field_median_value: false
                  include_field_sample_values: false
                  include_field_stddev_value: false

        
                stateful_ingestion:
                  enabled: true
                  state_provider:
                    type: "datahub"
                    config:
                      datahub_api:
                        server: "http://our-gms:8080"
        
            pipeline_name: {{ pipeline }}
            sink:
              type: "datahub-rest"
              config:
                server: "http://our-gms:8080"

Expected behavior Not having a Out of Memory error in DataHub 0.13.3

Screenshots If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: Google Cloud GKE/Kubernetes
  • Version: DataHub 0.13.x

Additional context Logs with some summary after a successful execution with DataHub 0.12.x

Pipeline finished with at least 5 warnings; produced 33546 events in 3 hours, 26 minutes and 30.91 seconds.
{}
 'pending_requests': 0}
 'gms_version': 'v0.13.0',
 'total_duration_in_seconds': 12396.68,
 'current_time': '2024-08-11 03:26:39.897462 (now)',
 'start_time': '2024-08-11 00:00:03.217296 (3 hours, 26 minutes and 36.68 seconds ago)',
 'failures': [],
 'warnings': [],
 'records_written_per_second': 2,
{'total_records_written': 33546,
Sink (datahub-rest) report:
 'running_time': '3 hours, 26 minutes and 30.91 seconds'}
 'start_time': '2024-08-11 00:00:08.984329 (3 hours, 26 minutes and 30.91 seconds ago)',
 'stateful_usage_ingestion_enabled': True,
 'usage_end_time': '2024-08-11 00:00:08.942642+00:00 (3 hours, 26 minutes and 30.95 seconds ago)',
 'usage_start_time': '2024-08-10 00:00:00+00:00 (1 day, 3 hours and 26 minutes ago)',
                                       'extra={}',
                                       "successful lineage ingestion. and will not run lineage ingestion for same timestamps in subsequent run. '"
 'stateful_lineage_ingestion_enabled': "default=True description='Enable stateful lineage ingestion. This will store lineage window timestamps after "
 'lineage_end_time': '2024-08-11 00:00:08.942642+00:00 (3 hours, 26 minutes and 30.95 seconds ago)',
 'lineage_start_time': '2024-08-10 00:00:00+00:00 (1 day, 3 hours and 26 minutes ago)',
----
 'partition_info': {},
                      'sampled': '10 sampled of at most 3059 entries.'},
-----
 'num_usage_parsed_log_entries': {'*': 122613},
 'num_usage_total_log_entries': {'*': 175145},
 'lineage_metadata_entries': {'*': 99},
 'num_lineage_parsed_log_entries': {'*': 108696},
 'num_lineage_total_log_entries': {'*': 108696},
 'num_skipped_lineage_entries_other': {},
 'num_lineage_entries_sql_parser_failure': {'host: 147},
 'num_skipped_lineage_entries_not_allowed': {'host': 90366},
 'num_skipped_lineage_entries_missing_data': {'host': 14177},
 'num_total_lineage_entries': {'host': 108696},
 'audit_log_api_perf': {'get_exported_log_entries': '397.101 seconds', 'list_log_entries': None},
 'total_query_log_entries': 0,
 'sampled': '10 sampled of at most 2880 entries.'},
 'num_usage_parsed_log_entries': {'*': 122613},
 'num_usage_total_log_entries': {'*': 175145},
 'lineage_metadata_entries': {'*': 99},
 'num_lineage_parsed_log_entries': {'*': 108696},
 'num_lineage_total_log_entries': {'*': 108696},
 'num_skipped_lineage_entries_other': {},
 'num_lineage_entries_sql_parser_failure': {'host': 147},
 'num_skipped_lineage_entries_not_allowed': {'host': 90366},
 'num_skipped_lineage_entries_missing_data': {'host': 14177},
 'num_total_lineage_entries': {'host': 108696},

vejeta avatar Aug 12 '24 11:08 vejeta

Same behaviour in 0.14.x (tested in 0.14.0.1, 0.14.0.2)

vejeta avatar Aug 22 '24 08:08 vejeta

@vejeta could you generate a memory profile - see https://datahubproject.io/docs/metadata-ingestion/docs/dev_guides/profiling_ingestions/

hsheth2 avatar Sep 16 '24 21:09 hsheth2

Thanks, @hsheth2 for the suggestion! I will do that with the latest version (0.14.1)

vejeta avatar Sep 18 '24 13:09 vejeta

Just out of curiosity, to see if this is easier, is there a published docker image with the memray dependency included? After enabling the memory profiling in my recipe, I am getting:

ERROR    {datahub.entrypoints:218} - Command failed: No module named 'memray' 

It that doesn't exist, no worries, I will see to test it with our own docker image with memray included.

vejeta avatar Sep 25 '24 11:09 vejeta

memray is installed when you install the datahub debug sub-package: https://github.com/datahub-project/datahub/blob/cf1d2966da418535ab53826fa51c73ac51290f09/metadata-ingestion/setup.py#L895

Check these docs on how to use it: https://datahubproject.io/docs/metadata-ingestion/docs/dev_guides/profiling_ingestions/#how-to-use (opt for the cli option)

pedro93 avatar Sep 25 '24 11:09 pedro93

@hsheth2, @pedro93 I have generated the file with the memory profiling. It is 801Mb in size.

Shall I provide the output of running memray over it somehow? What output can be more useful?

vejeta avatar Oct 02 '24 01:10 vejeta

For starters, if you could run memray flamegraph file.bin to generate an html file, that'd be helpful. Hopefully that gives us enough info, but we might need the full file if there isn't enough in the flamegraph.

Also - we separately have been working on a bigquery lineage/usage v2 implementation (enabled with use_queries_v2). If the bottleneck is from that, we'll probably ask you to enable the new version since it is a bit more optimized with regards to runtime and memory usage

hsheth2 avatar Oct 02 '24 03:10 hsheth2

@hsheth2 I see. Thanks a lot. This has been generated with 0.13.3.

I tried with 0.14.1 but our GMS is not in 0.14.x yet, so there are some incompatibilities when sending the metadata.

Hopefully, it is enough to locate the bottleneck. If use_queries_v2 is for 0.14.x, we will need to perform an upgrade first, that will take a bit of time.

Attached the flamegraph:

memray-flamegraph-bigquery-2024_10_01-20_23_13.html.gz

vejeta avatar Oct 02 '24 09:10 vejeta

@vejeta looking at the flamegraph, it looks like the OOM is being caused by our SQL parser.

If that's the case, use_queries_v2 probably won't help. We made a few tweaks to the SQL parser since 0.13.3 which might help, but I'm not too confident on that.

If you could run with datahub --debug ingest ..., the last ~1000 lines or so of the log should have some additional details about the exact SQL query that we're having issues with.

hsheth2 avatar Oct 02 '24 17:10 hsheth2

Thanks a lot @hsheth2, here it goes the last 4000 lines with --debug.

I had to "anonymize" the field names, hopefully it still can give a clue of what introduced the memory leak from 0.12.x to 0.13+

bigquery_job_4000lines.txt.gz

vejeta avatar Oct 09 '24 16:10 vejeta

Hi, we are facing the same issue all info is in ticket https://github.com/datahub-project/datahub/issues/11597 and I attach here also the anonymized logs using debug mode acryl.log

elopez-lodgify avatar Oct 14 '24 06:10 elopez-lodgify

Now that https://github.com/datahub-project/datahub/pull/11432 and #11645 are both part of 0.14.1.3 - would you guys mind giving this another try?

hsheth2 avatar Oct 18 '24 23:10 hsheth2

Thanks a lot @hsheth2 for the quick feedback on this issue. I am going to work on migrating from 0.13.x to the latest 0.14.x so I can provide feedback on this.

Will the tag 0.14.1.3 be published soon?

vejeta avatar Oct 20 '24 17:10 vejeta

That's the CLI version, which is published here https://pypi.org/project/acryl-datahub/0.14.1.3/

hsheth2 avatar Oct 22 '24 17:10 hsheth2

We have upgraded to 0.14.1.

I have tried the datahub-cli version that you recommended, using this in our bigquery ingestion recipe.

containers:
  - name: bigquery
     image: 'acryldata/datahub-ingestion:{{ DATAHUB_INGESTION_VERSION | default('v0.14.1') }}'
     imagePullPolicy: Always
     args: ["ingest", "-c", "/recipes/bq-profile-recipe.yml"]
     command: [ "sh", "-c", "pip install acryl-datahub[bigquery,cloud,debug]==0.14.1.3 && python -m pip install memray && datahub --debug ingest -c /recipes/bq-profile-recipe.yml" ]

That is pulling the required version, and is starting the process as always, but we see a new issue, the same that is described here: https://github.com/datahub-project/datahub/issues/11679

failed to write record with workunit urn:li:container:631fc658010d79ac8c2b5161ec400174-browsePathsV2 with 
('Unable to emit metadata to DataHub GMS, likely because the server version is too old relative to the client: Failed to validate record with class com.linkedin.container.ContainerProperties: 
ERROR :: /env :: unrecognized field found but not allowed\n', {'exceptionClass': 

Is there any workaround?

vejeta avatar Nov 27 '24 16:11 vejeta

I ran it with the 0.14.1.12 and it ran without the metadata error, but the process was killed due the OOM again. Will send the memory dump generated with memray next.

vejeta avatar Dec 13 '24 00:12 vejeta

@vejeta Sorry I missed your earlier message here

Re workaround - you can set the DATAHUB_INCLUDE_ENV_IN_CONTAINER_PROPERTIES=false env variable - it's a bit of a hack we put in to work around that exact issue. https://github.com/datahub-project/datahub/blob/07034caf09a930ad39fc2a594d1503bdf6c34ed8/metadata-ingestion/src/datahub/emitter/mcp_builder.py#L39-L47

I was optimistic on those fixes improving this situation, but I guess not. Both logs and the memory dump from the latest run would be super useful - thanks

hsheth2 avatar Dec 13 '24 04:12 hsheth2

Thanks @hsheth2, no worries, all the hints are really appreciated.

I attach the last flamegraph, with 0.14.1 in the backend(GMS), and 0.14.1.12 on the python's datahub-cli

memray-flamegraph-bigquery-2024_12_13-01_16_09-mweoqb.html.gz

vejeta avatar Dec 13 '24 16:12 vejeta

Weirdly, I don't really see any issues on this flamegraph. The peak memory usage is reported as 1.3 GB, and most of it is consumed by network IO

image

hsheth2 avatar Dec 13 '24 17:12 hsheth2

I see a lot of network issues on this last run, they also appear on the logs, we could have experienced some network outage on this run. I will repeat this process again. I might need some time, I am on vacation. Thanks!

vejeta avatar Dec 15 '24 20:12 vejeta

Sorry for the delay @hsheth2,

I had the chance to perform another run. As a recap, our GMS is in 0.14.1, the cronjob uses the 0.12.0 just fine without OOM issues.

I performed another run with memray enabled with 0.15.0.4

command: [ "sh", "-c", "mkdir -p /opt/datahub && python -m pip install memray && pip install acryl-datahub[bigquery,cloud,debug]==0.15.0.4 && datahub --debug ingest -c /recipes/bq-profile-recipe.yml" ]

I couldn´t use 0.15.0.5 as client because it complained about:

ImportError: cannot import name 'IcebergCatalogInfoClass' from 'acryl_datahub_cloud.metadata.schema_classes' (/datahub-ingestion/.venv/lib/python3.10/site-packages/acryl_datahub_cloud/metadata/schema_classes.py)

I attach latest flamegraph and logs:

memray-flamegraph-bigquery-2025_02_18-13_22_36-5jmajx.html.gz bigquery_18000.log.gz

Edit:

Adding a couple of graphs and screenshots taking during the execution, the memory behaviour seems to show a different line than previously, although the result is the same. Pod killed with 137 error code.

Image Image

vejeta avatar Feb 19 '25 11:02 vejeta

We recently migrated to DataHub 1.0.0, both in the GMS, and the datahub-cli, the error is still present, and the OOM appears on this specific job.

vejeta avatar Jun 19 '25 10:06 vejeta

The issue keeps happening in 1.2.0 and 1.2.0.1 datahub versions.

vejeta avatar Sep 16 '25 13:09 vejeta