metricflow icon indicating copy to clipboard operation
metricflow copied to clipboard

Add support for Hive

Open yanghua opened this issue 2 years ago • 5 comments

Describe the Feature Considering Apache Hive is still the standard for the data warehouse in the open-source Hadoop ecosystem. Can we support interaction with Hive?

Would you like to contribute? yes

Anything Else? no

yanghua avatar May 18 '22 09:05 yanghua

Hi @yanghua , thanks for filing this! In principle adding support for Hive should be straightforward since we support dialect-specific SQL rendering, so we'll be able to have a Hive renderer that generates fully correct HiveQL.

In practice, things might get tricky in two particular areas:

  1. Automated testing - we'd like to be able to run our test suite against a Hive instance. Ideally this would be a local instance rather than a remote, as that tends to be much faster where available, but I don't know much about the current state of local execution for Hive.
  2. Package dependency management - hopefully this isn't an issue, but we'd prefer to avoid pulling in lots of Hive-specific dependencies in our base Metricflow package since we currently build a monolithic package for all deployments.

For the first item, if you're interested in contributing it'd be great if you could investigate this.

For the second, I think we'll need to see what all Hive pulls in and make a decision about whether to have Hive support live in a fork (or separate extension of some kind) once we know more about that.

tlento avatar May 24 '22 18:05 tlento

Hi @tlento , sorry for the late reply. I have been busy recently.

  1. Automated testing - we'd like to be able to run our test suite against a Hive instance. Ideally this would be a local instance rather than a remote, as that tends to be much faster where available, but I don't know much about the current state of local execution for Hive.

For the first question, we can use docker to manage the Hive's runtime, considering Hive depends on the Hadoop ecosystem. And we can try to use testcontainers to bind the lifecycle of the container with the lifecycle of the unit tests.

  1. Package dependency management - hopefully this isn't an issue, but we'd prefer to avoid pulling in lots of Hive-specific dependencies in our base Metricflow package since we currently build a monolithic package for all deployments.

For the second question. Yeah, package management is an issue that we need to figure out. Maybe, we would split more and more integrated-engines with seprated modules in the future? For example, release metricflow-core, metricflow-extend or something else metricflow-engine-xxx.

WDYT?

yanghua avatar Jul 04 '22 10:07 yanghua

For the first question, we can use docker to manage the Hive's runtime, considering Hive depends on the Hadoop ecosystem. And we can try to use testcontainers to bind the lifecycle of the container with the lifecycle of the unit tests.

Oh so we'd have a Docker instance we communicate with as if it were a deployed service, that makes sense. We have something similar for Postgres now.

Maybe, we would split more and more integrated-engines with seprated modules in the future? For example, release metricflow-core, metricflow-extend or something else metricflow-engine-xxx.

This is a possibility. There's another issue about splitting out engine-specific dependencies more generally (see #84 ), so we're on board with the idea as long as the installation path is straightforward for common cases. We haven't taken the time to think about how to properly manage splitting out those sub-packages and making them available - I think our biggest concern is dependency conflicts (like if the hive engine build has to be pinned to a version SQLAlchemy that metricflow-core doesn't use, or something like that). But it's a reasonable place to start.

tlento avatar Jul 08 '22 23:07 tlento

Hi @tlento thanks for your thoughts. It seems it's valuable to explore the solution at least, right?

So I will start this work. IMO, when I have a demo, we could discuss it more explicitly.

WDYT?

yanghua avatar Jul 09 '22 02:07 yanghua

Sounds great, thanks!

tlento avatar Jul 14 '22 01:07 tlento

Hi @callum-mcdata Can you tell me, where is the implementation for the Apache Hive?

yanghua avatar Apr 23 '23 16:04 yanghua

Hi @yanghua there is no current implementation for Apache Hive.

We are doing some spring cleaning of issues as we move forward with our integration of dbt (more news to come) and there is a fair chance that the mechanisms for data warehouse connections will change. Given our desire to scope down the open issues to things that we'd want people to work on, we're closing out this issue for now. If this becomes a priority later on then we'll re-open the issue then!

callum-mcdata avatar Apr 23 '23 19:04 callum-mcdata