datahub
datahub copied to clipboard
OOM (Out of Memory Errors) in DataHub 0.13+ when ingesting Bigquery metadata
Describe the bug
We have an ingestion job, running periodically in Kubernetes, it runs fine with DataHub 0.12.x versions.
You can see the memory stays stable under 1GiB during the execution of the job.
However, with DataHub 0.13.x versions it always fail with Error 137, out of memory.
We have tried to increase the memory to 20GiB, but there must be a memory leak, because it alwasys run out of memory.
To Reproduce Steps to reproduce the behavior:
- Kubernetes CronJob set up
- 2Gi memory
- 2 CPU
- Our deployment:
- name: bigquery
image: 'acryldata/datahub-ingestion:v0.13.3'
imagePullPolicy: Always
args: ["ingest", "-c", "/recipes/bq-recipe.yml"]
Note, we have also tried, the latest release from the latest commit from master, and the issue is still present.
- Our BigQuery recipe
source:
type: bigquery
config:
project_on_behalf: {{ bq_slots_project }}
project_id_pattern:
allow:
- .*{{ gcp_project }}
dataset_pattern:
allow:
- {{ profile_dataset }}
deny:
- ^temp_.*
- .*_temp$
- .*_temp_.*
- .*-temp.*
- .*temporary.*
use_exported_bigquery_audit_metadata: true
bigquery_audit_metadata_datasets:
- {{ gcp_project }}.bigquery_audit_log
use_date_sharded_audit_log_tables: true
upstream_lineage_in_report: true
include_usage_statistics: true
capture_table_label_as_tag: true
capture_dataset_label_as_tag: true
extract_column_lineage: true
convert_urns_to_lowercase: true
profiling:
enabled: "true"
profile_table_size_limit: null
profile_table_row_limit: null
use_sampling: false
partition_profiling_enabled: false
include_field_mean_value: false
include_field_median_value: false
include_field_sample_values: false
include_field_stddev_value: false
stateful_ingestion:
enabled: true
state_provider:
type: "datahub"
config:
datahub_api:
server: "http://our-gms:8080"
pipeline_name: {{ pipeline }}
sink:
type: "datahub-rest"
config:
server: "http://our-gms:8080"
Expected behavior Not having a Out of Memory error in DataHub 0.13.3
Screenshots If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
- OS: Google Cloud GKE/Kubernetes
- Version: DataHub 0.13.x
Additional context Logs with some summary after a successful execution with DataHub 0.12.x
Pipeline finished with at least 5 warnings; produced 33546 events in 3 hours, 26 minutes and 30.91 seconds.
{}
'pending_requests': 0}
'gms_version': 'v0.13.0',
'total_duration_in_seconds': 12396.68,
'current_time': '2024-08-11 03:26:39.897462 (now)',
'start_time': '2024-08-11 00:00:03.217296 (3 hours, 26 minutes and 36.68 seconds ago)',
'failures': [],
'warnings': [],
'records_written_per_second': 2,
{'total_records_written': 33546,
Sink (datahub-rest) report:
'running_time': '3 hours, 26 minutes and 30.91 seconds'}
'start_time': '2024-08-11 00:00:08.984329 (3 hours, 26 minutes and 30.91 seconds ago)',
'stateful_usage_ingestion_enabled': True,
'usage_end_time': '2024-08-11 00:00:08.942642+00:00 (3 hours, 26 minutes and 30.95 seconds ago)',
'usage_start_time': '2024-08-10 00:00:00+00:00 (1 day, 3 hours and 26 minutes ago)',
'extra={}',
"successful lineage ingestion. and will not run lineage ingestion for same timestamps in subsequent run. '"
'stateful_lineage_ingestion_enabled': "default=True description='Enable stateful lineage ingestion. This will store lineage window timestamps after "
'lineage_end_time': '2024-08-11 00:00:08.942642+00:00 (3 hours, 26 minutes and 30.95 seconds ago)',
'lineage_start_time': '2024-08-10 00:00:00+00:00 (1 day, 3 hours and 26 minutes ago)',
----
'partition_info': {},
'sampled': '10 sampled of at most 3059 entries.'},
-----
'num_usage_parsed_log_entries': {'*': 122613},
'num_usage_total_log_entries': {'*': 175145},
'lineage_metadata_entries': {'*': 99},
'num_lineage_parsed_log_entries': {'*': 108696},
'num_lineage_total_log_entries': {'*': 108696},
'num_skipped_lineage_entries_other': {},
'num_lineage_entries_sql_parser_failure': {'host: 147},
'num_skipped_lineage_entries_not_allowed': {'host': 90366},
'num_skipped_lineage_entries_missing_data': {'host': 14177},
'num_total_lineage_entries': {'host': 108696},
'audit_log_api_perf': {'get_exported_log_entries': '397.101 seconds', 'list_log_entries': None},
'total_query_log_entries': 0,
'sampled': '10 sampled of at most 2880 entries.'},
'num_usage_parsed_log_entries': {'*': 122613},
'num_usage_total_log_entries': {'*': 175145},
'lineage_metadata_entries': {'*': 99},
'num_lineage_parsed_log_entries': {'*': 108696},
'num_lineage_total_log_entries': {'*': 108696},
'num_skipped_lineage_entries_other': {},
'num_lineage_entries_sql_parser_failure': {'host': 147},
'num_skipped_lineage_entries_not_allowed': {'host': 90366},
'num_skipped_lineage_entries_missing_data': {'host': 14177},
'num_total_lineage_entries': {'host': 108696},
Same behaviour in 0.14.x (tested in 0.14.0.1, 0.14.0.2)
@vejeta could you generate a memory profile - see https://datahubproject.io/docs/metadata-ingestion/docs/dev_guides/profiling_ingestions/
Thanks, @hsheth2 for the suggestion! I will do that with the latest version (0.14.1)
Just out of curiosity, to see if this is easier, is there a published docker image with the memray dependency included? After enabling the memory profiling in my recipe, I am getting:
ERROR {datahub.entrypoints:218} - Command failed: No module named 'memray'
It that doesn't exist, no worries, I will see to test it with our own docker image with memray included.
memray is installed when you install the datahub debug sub-package: https://github.com/datahub-project/datahub/blob/cf1d2966da418535ab53826fa51c73ac51290f09/metadata-ingestion/setup.py#L895
Check these docs on how to use it: https://datahubproject.io/docs/metadata-ingestion/docs/dev_guides/profiling_ingestions/#how-to-use (opt for the cli option)
@hsheth2, @pedro93 I have generated the file with the memory profiling. It is 801Mb in size.
Shall I provide the output of running memray over it somehow? What output can be more useful?
For starters, if you could run memray flamegraph file.bin to generate an html file, that'd be helpful. Hopefully that gives us enough info, but we might need the full file if there isn't enough in the flamegraph.
Also - we separately have been working on a bigquery lineage/usage v2 implementation (enabled with use_queries_v2). If the bottleneck is from that, we'll probably ask you to enable the new version since it is a bit more optimized with regards to runtime and memory usage
@hsheth2 I see. Thanks a lot. This has been generated with 0.13.3.
I tried with 0.14.1 but our GMS is not in 0.14.x yet, so there are some incompatibilities when sending the metadata.
Hopefully, it is enough to locate the bottleneck.
If use_queries_v2 is for 0.14.x, we will need to perform an upgrade first, that will take a bit of time.
Attached the flamegraph:
@vejeta looking at the flamegraph, it looks like the OOM is being caused by our SQL parser.
If that's the case, use_queries_v2 probably won't help. We made a few tweaks to the SQL parser since 0.13.3 which might help, but I'm not too confident on that.
If you could run with datahub --debug ingest ..., the last ~1000 lines or so of the log should have some additional details about the exact SQL query that we're having issues with.
Thanks a lot @hsheth2, here it goes the last 4000 lines with --debug.
I had to "anonymize" the field names, hopefully it still can give a clue of what introduced the memory leak from 0.12.x to 0.13+
Hi, we are facing the same issue all info is in ticket https://github.com/datahub-project/datahub/issues/11597 and I attach here also the anonymized logs using debug mode acryl.log
Now that https://github.com/datahub-project/datahub/pull/11432 and #11645 are both part of 0.14.1.3 - would you guys mind giving this another try?
Thanks a lot @hsheth2 for the quick feedback on this issue. I am going to work on migrating from 0.13.x to the latest 0.14.x so I can provide feedback on this.
Will the tag 0.14.1.3 be published soon?
That's the CLI version, which is published here https://pypi.org/project/acryl-datahub/0.14.1.3/
We have upgraded to 0.14.1.
I have tried the datahub-cli version that you recommended, using this in our bigquery ingestion recipe.
containers:
- name: bigquery
image: 'acryldata/datahub-ingestion:{{ DATAHUB_INGESTION_VERSION | default('v0.14.1') }}'
imagePullPolicy: Always
args: ["ingest", "-c", "/recipes/bq-profile-recipe.yml"]
command: [ "sh", "-c", "pip install acryl-datahub[bigquery,cloud,debug]==0.14.1.3 && python -m pip install memray && datahub --debug ingest -c /recipes/bq-profile-recipe.yml" ]
That is pulling the required version, and is starting the process as always, but we see a new issue, the same that is described here: https://github.com/datahub-project/datahub/issues/11679
failed to write record with workunit urn:li:container:631fc658010d79ac8c2b5161ec400174-browsePathsV2 with
('Unable to emit metadata to DataHub GMS, likely because the server version is too old relative to the client: Failed to validate record with class com.linkedin.container.ContainerProperties:
ERROR :: /env :: unrecognized field found but not allowed\n', {'exceptionClass':
Is there any workaround?
I ran it with the 0.14.1.12 and it ran without the metadata error, but the process was killed due the OOM again. Will send the memory dump generated with memray next.
@vejeta Sorry I missed your earlier message here
Re workaround - you can set the DATAHUB_INCLUDE_ENV_IN_CONTAINER_PROPERTIES=false env variable - it's a bit of a hack we put in to work around that exact issue.
https://github.com/datahub-project/datahub/blob/07034caf09a930ad39fc2a594d1503bdf6c34ed8/metadata-ingestion/src/datahub/emitter/mcp_builder.py#L39-L47
I was optimistic on those fixes improving this situation, but I guess not. Both logs and the memory dump from the latest run would be super useful - thanks
Thanks @hsheth2, no worries, all the hints are really appreciated.
I attach the last flamegraph, with 0.14.1 in the backend(GMS), and 0.14.1.12 on the python's datahub-cli
memray-flamegraph-bigquery-2024_12_13-01_16_09-mweoqb.html.gz
Weirdly, I don't really see any issues on this flamegraph. The peak memory usage is reported as 1.3 GB, and most of it is consumed by network IO
I see a lot of network issues on this last run, they also appear on the logs, we could have experienced some network outage on this run. I will repeat this process again. I might need some time, I am on vacation. Thanks!
Sorry for the delay @hsheth2,
I had the chance to perform another run. As a recap, our GMS is in 0.14.1, the cronjob uses the 0.12.0 just fine without OOM issues.
I performed another run with memray enabled with 0.15.0.4
command: [ "sh", "-c", "mkdir -p /opt/datahub && python -m pip install memray && pip install acryl-datahub[bigquery,cloud,debug]==0.15.0.4 && datahub --debug ingest -c /recipes/bq-profile-recipe.yml" ]
I couldn´t use 0.15.0.5 as client because it complained about:
ImportError: cannot import name 'IcebergCatalogInfoClass' from 'acryl_datahub_cloud.metadata.schema_classes' (/datahub-ingestion/.venv/lib/python3.10/site-packages/acryl_datahub_cloud/metadata/schema_classes.py)
I attach latest flamegraph and logs:
memray-flamegraph-bigquery-2025_02_18-13_22_36-5jmajx.html.gz bigquery_18000.log.gz
Edit:
Adding a couple of graphs and screenshots taking during the execution, the memory behaviour seems to show a different line than previously, although the result is the same. Pod killed with 137 error code.
We recently migrated to DataHub 1.0.0, both in the GMS, and the datahub-cli, the error is still present, and the OOM appears on this specific job.
The issue keeps happening in 1.2.0 and 1.2.0.1 datahub versions.