OpenMetadata
OpenMetadata copied to clipboard
Delta Lake connector inside docker failed to access HDFS HA and profiler ingestion failed
Affected module Ingestion Framework
Describe the bug
I'm using openmetadata built from source using 0.11.0-release
. When I tried ingesting metadata through delta lake connector from our hadoop cluster with HDFS HA enabled, the connector was not able to recognize our cluster's dfs.nameservices
. And there is no place where I could set the needed hadoop configurations. Hence ingestion task failed.
To Reproduce
- Run commands below to build openmetadata and bring up all containers
make env38
make install_dev
make install_test precommit_install
make install_antlr_cli
make generate
./docker/run_local_docker.sh
-
Add new database service and choose delta lake connector
-
Add metadata ingestion task
-
logs of Airflow:
[2022-07-15 07:15:35,925] {deltalake.py:64} INFO - Establishing Sparks Session
[2022-07-15 07:15:41,333] {database_service.py:164} INFO - Scanned [FJXNY.u_user]
[2022-07-15 07:15:41,825] {metadata_rest.py:302} INFO - Successfully ingested table default.u_user
[2022-07-15 07:15:41,826] {database_service.py:164} INFO - Scanned [FJXNY.u_user_alluxio]
[2022-07-15 07:15:42,094] {metadata_rest.py:302} INFO - Successfully ingested table default.u_user_alluxio
[2022-07-15 07:15:46,414] {database_service.py:164} INFO - Scanned [FJXNY.ads_t_attachment]
[2022-07-15 07:15:46,977] {deltalake.py:150} ERROR - java.net.UnknownHostException: masters
[2022-07-15 07:15:46,983] {database_service.py:164} INFO - Scanned [FJXNY.ads_t_company]
[2022-07-15 07:15:47,206] {deltalake.py:150} ERROR - java.net.UnknownHostException: masters
[2022-07-15 07:15:47,210] {database_service.py:164} INFO - Scanned [FJXNY.ads_t_company_machine_status]
[2022-07-15 07:15:47,429] {deltalake.py:150} ERROR - java.net.UnknownHostException: masters
[2022-07-15 07:15:47,434] {database_service.py:164} INFO - Scanned [FJXNY.ads_t_company_power_generation]
[2022-07-15 07:15:47,654] {deltalake.py:150} ERROR - java.net.UnknownHostException: masters
[2022-07-15 07:15:47,658] {database_service.py:164} INFO - Scanned [FJXNY.ads_t_company_power_plan_config]
[2022-07-15 07:15:47,880] {deltalake.py:150} ERROR - java.net.UnknownHostException: masters
- logs of docker container:
22/07/15 07:15:46 WARN FileSystem: Failed to initialize fileystem hdfs://masters/dwh/ads/ads_t_attachment/_delta_log: java.lang.IllegalArgumentException: java.net.UnknownHostException: masters
22/07/15 07:15:47 WARN FileSystem: Failed to initialize fileystem hdfs://masters/dwh/ads/ads_t_company/_delta_log: java.lang.IllegalArgumentException: java.net.UnknownHostException: masters
22/07/15 07:15:47 WARN FileSystem: Failed to initialize fileystem hdfs://masters/dwh/ads/ads_t_company_machine_status/_delta_log: java.lang.IllegalArgumentException: java.net.UnknownHostException: masters
22/07/15 07:15:47 WARN FileSystem: Failed to initialize fileystem hdfs://masters/dwh/ads/ads_t_company_power_generation/_delta_log: java.lang.IllegalArgumentException: java.net.UnknownHostException: masters
22/07/15 07:15:47 WARN FileSystem: Failed to initialize fileystem hdfs://masters/dwh/ads/ads_t_company_power_plan_config/_delta_log: java.lang.IllegalArgumentException: java.net.UnknownHostException: masters
Expected behavior Delta lake connector should be able to ingest from hadoop cluster with HDFS HA enabled.
Version:
- OS: CentOS Linux release 7.7.1908
- Python version: Python 3.8.13
- OpenMetadata version: 0.11.0
- OpenMetadata Ingestion package version: 0.11.0
Additional context
After modifying ingestion/src/metadata/utils/connections.py
, I'm able to pass the needed hadoop configurations temporarily.
The metadata ingestion succeeded by passing the needed hadoop configurations through
connectionArguments
Then I encountered a new problem. Attempt to add profiler ingestion for delta lake connector would lead to errors below:
[2022-07-13 07:02:24,035] {orm_profiler.py:262} INFO - Executing profilers for FJXNY.default.presto.ads_t_attachment...
[2022-07-13 07:02:24,040] {core.py:282} ERROR - Error while running table metric for: ads_t_attachment - unhashable type: 'DeltaLakeClient'
[2022-07-13 07:02:24,042] {core.py:253} WARNING - Error trying to compute column profile for jlid - unhashable type: 'DeltaLakeClient'
[2022-07-13 07:02:24,042] {core.py:378} WARNING - Error trying to compute column profile for jlid - unhashable type: 'DeltaLakeClient'
[2022-07-13 07:02:24,044] {core.py:321} ERROR - Error computing query metric uniqueCount for ads_t_attachment.jlid - unhashable type: 'DeltaLakeClient'
[2022-07-13 07:02:24,045] {core.py:253} WARNING - Error trying to compute column profile for zt - unhashable type: 'DeltaLakeClient'
[2022-07-13 07:02:24,046] {core.py:378} WARNING - Error trying to compute column profile for zt - unhashable type: 'DeltaLakeClient'
[2022-07-13 07:02:24,046] {core.py:321} ERROR - Error computing query metric uniqueCount for ads_t_attachment.zt - unhashable type: 'DeltaLakeClient'
[2022-07-13 07:02:24,047] {core.py:253} WARNING - Error trying to compute column profile for scr - unhashable type: 'DeltaLakeClient'
[2022-07-13 07:02:24,048] {core.py:378} WARNING - Error trying to compute column profile for scr - unhashable type: 'DeltaLakeClient'
[2022-07-13 07:02:24,049] {core.py:321} ERROR - Error computing query metric uniqueCount for ads_t_attachment.scr - unhashable type: 'DeltaLakeClient'
But @pmbrull has made a very detailed clarification that the profiler currently only supports sources that can be reached via SQLAlchemy. The team will need to flag DeltaLake to not show the Profiler Workflow option.
Thanks for opening the ticket and all the details. We'll handle this asap.