kedro-plugins icon indicating copy to clipboard operation
kedro-plugins copied to clipboard

External tables support for SparkHiveDataSet

Open DebanjanBanerjeeQB opened this issue 3 years ago • 3 comments
trafficstars

Description

SparkHiveDataset does not allow external hive tables at the moment. External tables are often encountered when the org database is outside hive and the table needs to be hosted in hive. More info available on : https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.5/using-hiveql/content/hive_create_an_external_table.html

Context

This will broaden the scope for hive datasets. Write now ant externally managed hive dataset needs to be referenced via a custom dataset and this happens quite often

Possible Implementation

Implementation is super simple. User needs to specify the keyword "External" in the DDL and specify a path for the table schema. Both can be tactically managed/input via catalog. Basis this input , the dataset should internally be able to decide the next course of actions and load/save data accordingly

Possible Alternatives

Accessing Hive table via HQL (but this again requires a HiveQueryDataSet (custom) ) which can access the metastore and query (bit slow)

DebanjanBanerjeeQB avatar Nov 17 '22 15:11 DebanjanBanerjeeQB

Thanks for the suggestion @DebanjanBanerjeeQB ! We would very much welcome a contribution for this. Since this is a datasets related issue, please add any contributions in the new datasets repo: https://github.com/kedro-org/kedro-plugins/tree/main/kedro-datasets

merelcht avatar Nov 29 '22 18:11 merelcht

Hey @merelcht, I would like to take this up.

MinuraPunchihewa avatar Oct 01 '24 04:10 MinuraPunchihewa

Thanks @MinuraPunchihewa ! Just go ahead and create a PR whenever you're ready 🙂

merelcht avatar Oct 01 '24 08:10 merelcht