docker-hive
docker-hive copied to clipboard
[Question] How to make this hive/presto setup query parquet on AWS S3
Simply tried to add aws hadoop jar to path for every each docker image and did not seem to work. Any advices?
Thanks in advance.
Tried to create external table:
CREATE EXTERNAL TABLE IF NOT EXISTS aaa (
event_id string,
make string)
STORED AS PARQUET
LOCATION 's3://<access_key>:<secret_key>@<some_bucket>/<some_path>';
But ended up
Error: org.apache.hive.service.cli.HiveSQLException: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:Got exception: java.io.IOException /<some_path> doesn't exist)
I tried the same keys to lookup the files on the path in s3 using aws cli and it was there.
We have a schema / external table on AWS athena for this folder as well, and it was working fine.
I have add aws-hadoop macen dependency
FROM maven:3.6.1-jdk-8 AS DEPENDENCY_BUILD
ADD pom.xml pom.xml
RUN mvn dependency:copy-dependencies -DoutputDirectory=/external-jars/
FROM bde2020/hive:2.3.2-postgresql-metastore
COPY --from=DEPENDENCY_BUILD /external-jars/ /external-jars/
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.datainfra.hive</groupId>
<artifactId>hive</artifactId>
<version>1.0-SNAPSHOT</version>
<name>hive-metastore-external-dependency</name>
<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-aws</artifactId>
<version>2.7.7</version>
</dependency>
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk</artifactId>
<version>1.7.4</version>
</dependency>
</dependencies>
</project>
HIVE_AUX_JARS_PATH=/external-jars/
Yep, rebuilding container 'bde2020/hive:2.3.2-postgresql-metastore' with adding aws jar dependencies and replacing it for hive-server and hive-metastore worked pretty well to me as @GrigorievNick recommended. Only, few additions, if you want to avoid s3 credentials when you creating hive tables you need to add to your hadoop-hive.env:
CORE_CONF_fs_s3a_aws_credentials_provider=org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider
CORE_CONF_fs_s3a_access_key=***
CORE_CONF_fs_s3a_secret_key=***
More info docs. It worked for me with Minio object storage, I think with S3 will be the same.
@shveikus @GrigorievNick Can you provide the detailed steps to make this working. I'm new to docker. I tried adding following to dockerfile after creating pom.xml in the same folder. No luck.
Hi @Ravipillala I don't remeber all details. But here is Docker file, that I use.
#FROM fabric8/java-alpine-openjdk8-jdk:1.6.3
#FROM adoptopenjdk/openjdk11:alpine-slim
FROM maven:3.6.1-jdk-8 AS DEPENDENCY_BUILD
ADD pom.xml pom.xml
RUN mvn dependency:copy-dependencies -DoutputDirectory=/external-jars/
FROM bde2020/hive:2.3.2-postgresql-metastore
COPY --from=DEPENDENCY_BUILD /external-jars/ /external-jars/
And pom.xml
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.appsflyer.datainfra.hive</groupId>
<artifactId>hive</artifactId>
<version>1.0-SNAPSHOT</version>
<name>hive-metastore-external-dependency</name>
<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-aws</artifactId>
<version>2.7.7</version>
</dependency>
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk</artifactId>
<version>1.7.4</version>
</dependency>
</dependencies>
</project>
And build command
docker build -t artifactory.com:5000/mykola.hryhoriev/hive:2.3.2-postgresql-metastore hivemetastore/
I am personally deploying this to k8s, so I will add YAML, you can see ENV variables and configs there.
apiVersion: v1
data:
HIVE_AUX_JARS_PATH: /external-jars/
HIVE_SITE_CONF_datanucleus_autoCreateSchema: "false"
HIVE_SITE_CONF_fs_s3a_access_key: Enter your key jere
HIVE_SITE_CONF_fs_s3a_secret_key: enter you secret
HIVE_SITE_CONF_hive_metastore_uris: thrift://hive-metastore:9083
HIVE_SITE_CONF_hive_metastore_warehouse_dir: s3a://stage-presto-iceberg-metadata/iceberg-catalog
HIVE_SITE_CONF_javax_jdo_option_ConnectionDriverName: org.postgresql.Driver
HIVE_SITE_CONF_javax_jdo_option_ConnectionPassword: hive
HIVE_SITE_CONF_javax_jdo_option_ConnectionURL: jdbc:postgresql://hive-metastore-postgresql/metastore
HIVE_SITE_CONF_javax_jdo_option_ConnectionUserName: hive
kind: ConfigMap
metadata:
name: hive-env
namespace: bigdata
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: hive-metastore
namespace: bigdata
spec:
selector:
matchLabels:
app: hive-metastore
replicas: 1
template:
metadata:
labels:
app: hive-metastore
spec:
nodeSelector:
kubernetes.io/lifecycle: normal
containers:
- name: hive-metastore
image: artifactory.com:5000/mykola.hryhoriev/hive:2.3.2-postgresql-metastore
args:
- /opt/hive/bin/hive --service metastore
imagePullPolicy: Always
env:
- name: SERVICE_PRECONDITION
value: hive-metastore-postgresql:5432
envFrom:
- configMapRef:
name: hive-env
ports:
- containerPort: 9083
Hi @Ravipillala , you can try this:
- Create a folder and put there Dockerfile
ADD pom.xml pom.xml
RUN mvn dependency:copy-dependencies -DoutputDirectory=/external-jars/
FROM bde2020/hive:2.3.2-postgresql-metastore
COPY --from=DEPENDENCY_BUILD /external-jars/ /external-jars/
and pom.xml (this is my own)
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.datainfra.hive</groupId>
<artifactId>hive</artifactId>
<version>1.0-SNAPSHOT</version>
<name>hive-metastore-external-dependency</name>
<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-shuffle</artifactId>
<version>2.7.4</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-jobclient</artifactId>
<version>2.7.4</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.7.4</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-aws</artifactId>
<version>2.7.4</version>
</dependency>
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk</artifactId>
<version>1.7.4</version>
</dependency>
<dependency>
<groupId>org.keedio.openx.data</groupId>
<artifactId>json-serde</artifactId>
<version>1.3.7.3</version>
</dependency>
</dependencies>
</project>
- after that build new docker image with your tag
docker build . -t my-hive-aws:0.1
- then you need to replace all stock images with yours, something like this
hive-server:
image: my-hive-aws:0.1
env_file:
- ./hadoop-hive.env
environment:
HIVE_CORE_CONF_javax_jdo_option_ConnectionURL: "jdbc:postgresql://hive-metastore/metastore"
SERVICE_PRECONDITION: "hive-metastore:9083"
ports:
- "10000:10000"
- then add credentials to your hadoop-hive.env as I mentioned above.
Hope, this will be useful.
Hi, I am sorry for the offtopic. @Ravipillala working PoC: hive+s3+presto https://github.com/zhenik-poc/big-data-stack-practice
Thank you