docker-hive icon indicating copy to clipboard operation
docker-hive copied to clipboard

[Question] How to make this hive/presto setup query parquet on AWS S3

Open leletan opened this issue 6 years ago • 9 comments

Simply tried to add aws hadoop jar to path for every each docker image and did not seem to work. Any advices?

Thanks in advance.

leletan avatar Jul 02 '18 18:07 leletan

Tried to create external table:

CREATE EXTERNAL TABLE IF NOT EXISTS aaa (
event_id string,
make string)
STORED AS PARQUET
LOCATION 's3://<access_key>:<secret_key>@<some_bucket>/<some_path>';

But ended up

Error: org.apache.hive.service.cli.HiveSQLException: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:Got exception: java.io.IOException /<some_path> doesn't exist)

I tried the same keys to lookup the files on the path in s3 using aws cli and it was there.

leletan avatar Jul 02 '18 18:07 leletan

We have a schema / external table on AWS athena for this folder as well, and it was working fine.

leletan avatar Jul 02 '18 19:07 leletan

I have add aws-hadoop macen dependency

FROM maven:3.6.1-jdk-8 AS DEPENDENCY_BUILD
ADD pom.xml pom.xml
RUN mvn dependency:copy-dependencies -DoutputDirectory=/external-jars/

FROM bde2020/hive:2.3.2-postgresql-metastore

COPY --from=DEPENDENCY_BUILD /external-jars/ /external-jars/
<?xml version="1.0" encoding="UTF-8"?>

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.datainfra.hive</groupId>
    <artifactId>hive</artifactId>
    <version>1.0-SNAPSHOT</version>
    <name>hive-metastore-external-dependency</name>

    <dependencies>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-aws</artifactId>
            <version>2.7.7</version>
        </dependency>
        <dependency>
            <groupId>com.amazonaws</groupId>
            <artifactId>aws-java-sdk</artifactId>
            <version>1.7.4</version>
        </dependency>
    </dependencies>
</project>

HIVE_AUX_JARS_PATH=/external-jars/

GrigorievNick avatar Aug 12 '19 19:08 GrigorievNick

Yep, rebuilding container 'bde2020/hive:2.3.2-postgresql-metastore' with adding aws jar dependencies and replacing it for hive-server and hive-metastore worked pretty well to me as @GrigorievNick recommended. Only, few additions, if you want to avoid s3 credentials when you creating hive tables you need to add to your hadoop-hive.env:

CORE_CONF_fs_s3a_aws_credentials_provider=org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider
CORE_CONF_fs_s3a_access_key=***
CORE_CONF_fs_s3a_secret_key=***

More info docs. It worked for me with Minio object storage, I think with S3 will be the same.

shveikus avatar Mar 02 '20 10:03 shveikus

@shveikus @GrigorievNick Can you provide the detailed steps to make this working. I'm new to docker. I tried adding following to dockerfile after creating pom.xml in the same folder. No luck.

Ravipillala avatar May 13 '20 05:05 Ravipillala

Hi @Ravipillala I don't remeber all details. But here is Docker file, that I use.

#FROM fabric8/java-alpine-openjdk8-jdk:1.6.3
#FROM adoptopenjdk/openjdk11:alpine-slim
FROM maven:3.6.1-jdk-8 AS DEPENDENCY_BUILD
ADD pom.xml pom.xml
RUN mvn dependency:copy-dependencies -DoutputDirectory=/external-jars/

FROM bde2020/hive:2.3.2-postgresql-metastore

COPY --from=DEPENDENCY_BUILD /external-jars/ /external-jars/

And pom.xml

<?xml version="1.0" encoding="UTF-8"?>

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.appsflyer.datainfra.hive</groupId>
    <artifactId>hive</artifactId>
    <version>1.0-SNAPSHOT</version>
    <name>hive-metastore-external-dependency</name>

    <dependencies>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-aws</artifactId>
            <version>2.7.7</version>
        </dependency>
        <dependency>
            <groupId>com.amazonaws</groupId>
            <artifactId>aws-java-sdk</artifactId>
            <version>1.7.4</version>
        </dependency>
    </dependencies>
</project>

And build command

docker build -t artifactory.com:5000/mykola.hryhoriev/hive:2.3.2-postgresql-metastore hivemetastore/

I am personally deploying this to k8s, so I will add YAML, you can see ENV variables and configs there.

apiVersion: v1
data:
  HIVE_AUX_JARS_PATH: /external-jars/
  HIVE_SITE_CONF_datanucleus_autoCreateSchema: "false"
  HIVE_SITE_CONF_fs_s3a_access_key: Enter your key jere
  HIVE_SITE_CONF_fs_s3a_secret_key: enter you secret
  HIVE_SITE_CONF_hive_metastore_uris: thrift://hive-metastore:9083
  HIVE_SITE_CONF_hive_metastore_warehouse_dir: s3a://stage-presto-iceberg-metadata/iceberg-catalog
  HIVE_SITE_CONF_javax_jdo_option_ConnectionDriverName: org.postgresql.Driver
  HIVE_SITE_CONF_javax_jdo_option_ConnectionPassword: hive
  HIVE_SITE_CONF_javax_jdo_option_ConnectionURL: jdbc:postgresql://hive-metastore-postgresql/metastore
  HIVE_SITE_CONF_javax_jdo_option_ConnectionUserName: hive
kind: ConfigMap
metadata:
  name: hive-env
  namespace: bigdata
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: hive-metastore
  namespace: bigdata
spec:
  selector:
    matchLabels:
      app: hive-metastore
  replicas: 1
  template:
    metadata:
      labels:
        app: hive-metastore
    spec:
      nodeSelector:
        kubernetes.io/lifecycle: normal
      containers:
        - name: hive-metastore
          image: artifactory.com:5000/mykola.hryhoriev/hive:2.3.2-postgresql-metastore
          args:
            - /opt/hive/bin/hive --service metastore
          imagePullPolicy: Always
          env:
            - name: SERVICE_PRECONDITION
              value: hive-metastore-postgresql:5432
          envFrom:
            - configMapRef:
                name: hive-env
          ports:
            - containerPort: 9083

GrigorievNick avatar May 17 '20 12:05 GrigorievNick

Hi @Ravipillala , you can try this:

  1. Create a folder and put there Dockerfile
ADD pom.xml pom.xml
RUN mvn dependency:copy-dependencies -DoutputDirectory=/external-jars/

FROM bde2020/hive:2.3.2-postgresql-metastore

COPY --from=DEPENDENCY_BUILD /external-jars/ /external-jars/

and pom.xml (this is my own)


<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.datainfra.hive</groupId>
    <artifactId>hive</artifactId>
    <version>1.0-SNAPSHOT</version>
    <name>hive-metastore-external-dependency</name>

    <dependencies>
       <dependency>
          <groupId>org.apache.hadoop</groupId>
          <artifactId>hadoop-mapreduce-client-shuffle</artifactId>
          <version>2.7.4</version>
        </dependency>
        <dependency>
          <groupId>org.apache.hadoop</groupId>
          <artifactId>hadoop-mapreduce-client-jobclient</artifactId>
          <version>2.7.4</version>
        </dependency>
       <dependency>
          <groupId>org.apache.hadoop</groupId>
          <artifactId>hadoop-client</artifactId>
          <version>2.7.4</version>
        </dependency>
        <dependency>
          <groupId>org.apache.hadoop</groupId>
          <artifactId>hadoop-aws</artifactId>
          <version>2.7.4</version>
        </dependency>
        <dependency>
            <groupId>com.amazonaws</groupId>
            <artifactId>aws-java-sdk</artifactId>
            <version>1.7.4</version>
        </dependency>
        <dependency>
          <groupId>org.keedio.openx.data</groupId>
          <artifactId>json-serde</artifactId>
          <version>1.3.7.3</version>
        </dependency>
    </dependencies>
  </project>
  1. after that build new docker image with your tag
docker build . -t my-hive-aws:0.1
  1. then you need to replace all stock images with yours, something like this
  hive-server:
    image: my-hive-aws:0.1
    env_file:
      - ./hadoop-hive.env
    environment:
      HIVE_CORE_CONF_javax_jdo_option_ConnectionURL: "jdbc:postgresql://hive-metastore/metastore"
      SERVICE_PRECONDITION: "hive-metastore:9083"
    ports:
      - "10000:10000"
  1. then add credentials to your hadoop-hive.env as I mentioned above.

Hope, this will be useful.

shveikus avatar May 17 '20 17:05 shveikus

Hi, I am sorry for the offtopic. @Ravipillala working PoC: hive+s3+presto https://github.com/zhenik-poc/big-data-stack-practice

zhenik avatar May 18 '20 16:05 zhenik

Thank you

Ravipillala avatar May 19 '20 21:05 Ravipillala