datahub
datahub copied to clipboard
Incorrect input dataset platform for Redash Presto charts
Describe the bug
When using the Redash Connector (Datahub v0.8.21) with parse_table_names_from_sql: true
, the platform value used to construct the upstream dataset URN does not take into account the data catalog type (e.g. Hive) when a dashboard or chart is pulling from a Presto data source.
To Reproduce Steps to reproduce the behavior:
- Run the ingestion Redash recipe with
parse_table_names_from_sql: true
. The Redash cluster should be configured where Presto is the query engine and Hive is the underlying data catalog. - Once ingestion has finished, go to one of the ingested Redash charts from the Datahub UI and click on "Inputs". The platform for the input datasets is labeled as "Presto", but should actually be "Hive".
Expected behavior
When ingesting dashboards and charts from Redash with parse_table_names_from_sql: true
, if the data source type is Presto
, then check what is the underlying data catalog (e.g. Hive) to be used as the platform value when constructing the upstream dataset URN.
Sample JSON response from Redash's /api/data_sources/
API endpoint being called here.
{
"scheduled_queue_name": "scheduled_queries",
"name": "presto-prod",
"pause_reason": null, "queue_name": "queries",
"syntax": "sql",
"paused": 0,
"options": {
"catalog": **"hive"**,
"schema": "default",
"protocol": "http",
"host": "presto-prod.abc.com",
"port": 8889
},
"groups": {"2": false},
"type": **"presto"**,
"id": 1
}
In the example above, the upstream dataset platform should be set to hive
and not presto
.
Hi @taufiqibrahim! Would you mind taking a look at this one?
Hi @maggiehays , okay let me take a look on this one.
Hi @Eric-Xu @maggiehays
I've done some checks with Presto, and learn some concepts about Presto Catalog.
Basically, a catalog references a data source via a connector as defined here. So, it's just basically a name.
For Hive catalog/connector, the common default connector name is hive
and stored in /etc/catalog/hive.properties
configuration file inside Presto server. It's also allow us to have multiple Hive clusters as explained here.
So, if we have Hive connector/catalog using name other than hive
, for example sales
like describe on Presto doc above, it will be impossible to guess the platform correctly.
Any ideas about that?
Hi @taufiqibrahim! We recently saw this with our Metabase connector as well - take a look at how it was resolved in this PR
Hi @Eric-Xu & @taufiqibrahim - let me know if this is a fix you're able to push!
This issue is stale because it has been open for 30 days with no activity. If you believe this is still an issue on the latest DataHub release please leave a comment with the version that you tested it with. If this is a question/discussion please head to https://slack.datahubproject.io. For feature requests please use https://feature-requests.datahubproject.io