datahub icon indicating copy to clipboard operation
datahub copied to clipboard

Incorrect input dataset platform for Redash Presto charts

Open Eric-Xu opened this issue 2 years ago • 6 comments

Describe the bug When using the Redash Connector (Datahub v0.8.21) with parse_table_names_from_sql: true, the platform value used to construct the upstream dataset URN does not take into account the data catalog type (e.g. Hive) when a dashboard or chart is pulling from a Presto data source.

To Reproduce Steps to reproduce the behavior:

  1. Run the ingestion Redash recipe with parse_table_names_from_sql: true. The Redash cluster should be configured where Presto is the query engine and Hive is the underlying data catalog.
  2. Once ingestion has finished, go to one of the ingested Redash charts from the Datahub UI and click on "Inputs". The platform for the input datasets is labeled as "Presto", but should actually be "Hive".

Expected behavior When ingesting dashboards and charts from Redash with parse_table_names_from_sql: true, if the data source type is Presto, then check what is the underlying data catalog (e.g. Hive) to be used as the platform value when constructing the upstream dataset URN.

Sample JSON response from Redash's /api/data_sources/ API endpoint being called here.

{
    "scheduled_queue_name": "scheduled_queries",
    "name": "presto-prod",
    "pause_reason": null, "queue_name": "queries",
    "syntax": "sql",
    "paused": 0,
    "options": {
        "catalog": **"hive"**,
        "schema": "default",
        "protocol": "http",
        "host": "presto-prod.abc.com",
        "port": 8889
    }, 
    "groups": {"2": false}, 
    "type": **"presto"**,
    "id": 1
}

In the example above, the upstream dataset platform should be set to hive and not presto.

Eric-Xu avatar Jan 06 '22 20:01 Eric-Xu

Hi @taufiqibrahim! Would you mind taking a look at this one?

maggiehays avatar Jan 10 '22 19:01 maggiehays

Hi @maggiehays , okay let me take a look on this one.

taufiqibrahim avatar Jan 11 '22 07:01 taufiqibrahim

Hi @Eric-Xu @maggiehays

I've done some checks with Presto, and learn some concepts about Presto Catalog.

Basically, a catalog references a data source via a connector as defined here. So, it's just basically a name.

For Hive catalog/connector, the common default connector name is hive and stored in /etc/catalog/hive.properties configuration file inside Presto server. It's also allow us to have multiple Hive clusters as explained here.

So, if we have Hive connector/catalog using name other than hive, for example sales like describe on Presto doc above, it will be impossible to guess the platform correctly.

Any ideas about that?

taufiqibrahim avatar Jan 11 '22 15:01 taufiqibrahim

Hi @taufiqibrahim! We recently saw this with our Metabase connector as well - take a look at how it was resolved in this PR

maggiehays avatar Jan 12 '22 20:01 maggiehays

Hi @Eric-Xu & @taufiqibrahim - let me know if this is a fix you're able to push!

maggiehays avatar Feb 22 '22 20:02 maggiehays

This issue is stale because it has been open for 30 days with no activity. If you believe this is still an issue on the latest DataHub release please leave a comment with the version that you tested it with. If this is a question/discussion please head to https://slack.datahubproject.io. For feature requests please use https://feature-requests.datahubproject.io

github-actions[bot] avatar Sep 15 '22 02:09 github-actions[bot]