iceberg icon indicating copy to clipboard operation
iceberg copied to clipboard

Iceberg to configure AWS S3 configuration with the Hadoop and Hive4 setup is hanging without giving ant error

Open AwasthiSomesh opened this issue 1 year ago • 16 comments

Apache Iceberg version

1.6.1 (latest release)

Query engine

Hive

Please describe the bug 🐞

I am trying to configure AWS S3 configuration with the Hadoop and Hive setup.

But while trying so we are seeing following exception :

hadoop fs -ls s3a://somesh.qa.bucket/ -:

Fatal internal error java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found

To resolve this I have added hadoop-aws-3.3.6.jar and aws-java-sdk-bundle-1.12.770.jar in Hadoop classpath.

i.e is under : /usr/local/hadoop/share/hadoop/common/lib

And S3 related configurations in the core-site.xml file: under /usr/local/hadoop/etc/hadoop directory.

fs.default.name s3a://somesh.qa.bucket fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem fs.s3a.endpoint s3.us-west-2.amazonaws.com fs.s3a.access.key {Access _Key_Value} fs.s3a.secret.key {Secret_Key_Value} fs.s3a.path.style.access false

Now when we try hadoop fs -ls s3a://somesh.qa.bucket/

We are observing following exception :

2024-08-22 13:50:11,294 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties 2024-08-22 13:50:11,376 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s). 2024-08-22 13:50:11,376 INFO impl.MetricsSystemImpl: s3a-file-system metrics system started 2024-08-22 13:50:11,434 WARN util.VersionInfoUtils: The AWS SDK for Java 1.x entered maintenance mode starting July 31, 2024 and will reach end of support on December 31, 2025. For more information, see https://aws.amazon.com/blogs/developer/the-aws-sdk-for-java-1-x-is-in-maintenance-mode-effective-july-31-2024/ You can print where on the file system the AWS SDK for Java 1.x core runtime is located by setting the AWS_JAVA_V1_PRINT_LOCATION environment variable or aws.java.v1.printLocation system property to 'true'. This message can be disabled by setting the AWS_JAVA_V1_DISABLE_DEPRECATION_ANNOUNCEMENT environment variable or aws.java.v1.disableDeprecationAnnouncement system property to 'true'. The AWS SDK for Java 1.x is being used here: at java.lang.Thread.getStackTrace(Thread.java:1564) at com.amazonaws.util.VersionInfoUtils.printDeprecationAnnouncement(VersionInfoUtils.java:81) at com.amazonaws.util.VersionInfoUtils.(VersionInfoUtils.java:59) at com.amazonaws.internal.EC2ResourceFetcher.(EC2ResourceFetcher.java:44) at com.amazonaws.auth.InstanceMetadataServiceCredentialsFetcher.(InstanceMetadataServiceCredentialsFetcher.java:38) at com.amazonaws.auth.InstanceProfileCredentialsProvider.(InstanceProfileCredentialsProvider.java:111) at com.amazonaws.auth.InstanceProfileCredentialsProvider.(InstanceProfileCredentialsProvider.java:91) at com.amazonaws.auth.InstanceProfileCredentialsProvider.(InstanceProfileCredentialsProvider.java:75) at com.amazonaws.auth.InstanceProfileCredentialsProvider.(InstanceProfileCredentialsProvider.java:58) at com.amazonaws.auth.EC2ContainerCredentialsProviderWrapper.initializeProvider(EC2ContainerCredentialsProviderWrapper.java:66) at com.amazonaws.auth.EC2ContainerCredentialsProviderWrapper.(EC2ContainerCredentialsProviderWrapper.java:55) at org.apache.hadoop.fs.s3a.auth.IAMInstanceCredentialsProvider.(IAMInstanceCredentialsProvider.java:53) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.fs.s3a.S3AUtils.createAWSCredentialProvider(S3AUtils.java:727) at org.apache.hadoop.fs.s3a.S3AUtils.buildAWSProviderList(S3AUtils.java:659) at org.apache.hadoop.fs.s3a.S3AUtils.createAWSCredentialProviderSet(S3AUtils.java:585) at org.apache.hadoop.fs.s3a.S3AFileSystem.bindAWSClient(S3AFileSystem.java:959) at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:586) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3611) at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3712) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3663) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:557) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365) at org.apache.hadoop.fs.shell.PathData.expandAsGlob(PathData.java:347) at org.apache.hadoop.fs.shell.Command.expandArgument(Command.java:264) at org.apache.hadoop.fs.shell.Command.expandArguments(Command.java:247) at org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:105) at org.apache.hadoop.fs.shell.Command.run(Command.java:191) at org.apache.hadoop.fs.FsShell.run(FsShell.java:327) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:97) at org.apache.hadoop.fs.FsShell.main(FsShell.java:390) ls: s3a://infa.qa.bucket/: org.apache.hadoop.fs.s3a.auth.NoAuthWithAWSException: No AWS Credentials provided by TemporaryAWSCredentialsProvider SimpleAWSCredentialsProvider EnvironmentVariableCredentialsProvider IAMInstanceCredentialsProvider : com.amazonaws.SdkClientException: Unable to load AWS credentials from environment variables (AWS_ACCESS_KEY_ID (or AWS_ACCESS_KEY) and AWS_SECRET_KEY (or AWS_SECRET_ACCESS_KEY)) 2024-08-22 13:50:14,248 INFO impl.MetricsSystemImpl: Stopping s3a-file-system metrics system... 2024-08-22 13:50:14,248 INFO impl.MetricsSystemImpl: s3a-file-system metrics system stopped. 2024-08-22 13:50:14,248 INFO impl.MetricsSystemImpl: s3a-file-system metrics syst

Could you please help us to resolve this issue as soon as possible

Willingness to contribute

  • [ ] I can contribute a fix for this bug independently
  • [X] I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • [ ] I cannot contribute a fix for this bug at this time

AwasthiSomesh avatar Sep 16 '24 09:09 AwasthiSomesh

If this is a Hive4 issue, could you please try to talk to the Hive team, as the Hive4 integration is owned by them. Thanks, Peter

pvary avatar Sep 16 '24 15:09 pvary

@pvary We are facing this issue with iceberg with hive not sure which team can help better on this .

Please suggest if know anything , we also raised this with hive team as well

AwasthiSomesh avatar Sep 17 '24 09:09 AwasthiSomesh

@AwasthiSomesh: The issue name suggests that this problem happens with Hive4. That is why I suggested that the Apache Hive team could help you better. The Hive 4 integration is maintained by them. It is entirely possible that they could point out some issues with the Iceberg code, but they have some very specific Hive code before calling the Iceberg APIs.

pvary avatar Sep 17 '24 09:09 pvary

@pvary thanks for ur update

AwasthiSomesh avatar Sep 17 '24 09:09 AwasthiSomesh

looks like hive issue discussion is not available through git-hub anyone knows how to reach out hive4 team via GitHub

AwasthiSomesh avatar Sep 17 '24 09:09 AwasthiSomesh

You should create a Jira (https://issues.apache.org/jira/projects/HIVE/issues/HIVE-25351?filter=allopenissues), or use the dev/user list to communicate. See the github readme: https://github.com/apache/hive

pvary avatar Sep 17 '24 12:09 pvary

@pvary Thanks a lot for your quick response .

I have 2 below question could you please help me with your comments.

Q1. As mentioned in iceberg official document hive 4 supported for iceberg without any extra dependecny. https://iceberg.apache.org/docs/latest/hive/#feature-support image

Is it supported with HDFS storage or we can use it with S3/Adls gen2 as well ?.

Q2. If Hive 4 is not supported for other external storage like S3/Alds gen2 then what is the other alternative for this .. do we have any other option like hive 3/2/1 with other dependency to use iceberg with hive catalog with storage S3/Adls gen2.

Could you please help here ?.

Thanks, Somesh

AwasthiSomesh avatar Sep 18 '24 05:09 AwasthiSomesh

@pvary If Iceberg supports with ADLSgen2 then what are the configuration require to use it seamless.

AwasthiSomesh avatar Sep 18 '24 05:09 AwasthiSomesh

@pvary /all can anyone help here ?.

AwasthiSomesh avatar Sep 18 '24 08:09 AwasthiSomesh

@AwasthiSomesh: This should help: https://iceberg.apache.org/docs/nightly/kafka-connect/?h=adls#azure-adls-configuration-example

pvary avatar Sep 18 '24 19:09 pvary

@pvary I am able create iceberg table using hive4 setup and able to insert data as well but when we try to read its returning empty

image

Now when we see in s3 location all data file are created there.

Could you please let me know is there anything else we need to do ?

AwasthiSomesh avatar Sep 19 '24 09:09 AwasthiSomesh

@pvary we set set hive.execution.engine=mr? for insert else insert was not working with tez engine

but with mr we are not able to read any single table with hive 4.0.4. alph2

with tez we are facing below error wj=hile inserting records

Error:

.6.jar:?] at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$exists$34(S3AFileSystem.java:4636) ~[hadoop-aws-3.3.6.jar:?] at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.invokeTrackingDuration(IOStatisticsBinding.java:547) ~[hadoop-common-3.3.6.jar:?] at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.lambda$trackDurationOfOperation$5(IOStatisticsBinding.java:528) ~[hadoop-common-3.3.6.jar:?] at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDuration(IOStatisticsBinding.java:449) ~[hadoop-common-3.3.6.jar:?] at org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2480) ~[hadoop-aws-3.3.6.jar:?] at org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2499) ~[hadoop-aws-3.3.6.jar:?] at org.apache.hadoop.fs.s3a.S3AFileSystem.exists(S3AFileSystem.java:4634) ~[hadoop-aws-3.3.6.jar:?] at org.apache.tez.common.TezCommonUtils.getTezBaseStagingPath(TezCommonUtils.java:91) ~[tez-api-0.10.3.jar:0.10.3] at org.apache.tez.common.TezCommonUtils.getTezSystemStagingPath(TezCommonUtils.java:149) ~[tez-api-0.10.3.jar:0.10.3] at org.apache.tez.dag.app.DAGAppMaster.serviceInit(DAGAppMaster.java:492) ~[tez-dag-0.10.3.jar:0.10.3] at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) ~[hadoop-common-3.3.6.jar:?] at org.apache.tez.dag.app.DAGAppMaster$9.run(DAGAppMaster.java:2644) ~[tez-dag-0.10.3.jar:0.10.3] at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_342] at javax.security.auth.Subject.doAs(Subject.java:422) ~[?:1.8.0_342] at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899) ~[hadoop-common-3.3.6.jar:?] at org.apache.tez.dag.app.DAGAppMaster.initAndStartAppMaster(DAGAppMaster.java:2641) ~[tez-dag-0.10.3.jar:0.10.3] at org.apache.tez.client.LocalClient$1.run(LocalClient.java:361) ~[tez-dag-0.10.3.jar:0.10.3] ... 1 more ERROR : FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask. java.io.IOException: org.apache.tez.dag.api.TezUncheckedException: java.nio.file.AccessDeniedException: s3a://com.anush/opt/hive/scratch_dir/hive/_tez_session_dir/0c1896fa-2b9d-4461-9ab4-ced0fd46ef48: org.apache.hadoop.fs.s3a.auth.NoAuthWithAWSException: No AWS Credentials provided by TemporaryAWSCredentialsProvider SimpleAWSCredentialsProvider EnvironmentVariableCredentialsProvider IAMInstanceCredentialsProvider : com.amazonaws.SdkClientException: Unable to load AWS credentials from environment variables (AWS_ACCESS_KEY_ID (or AWS_ACCESS_KEY) and AWS_SECRET_KEY (or AWS_SECRET_ACCESS_KEY)) INFO : Completed executing command(queryId=hive_20240919065346_a71fd349-e14c-4bfa-9fb7-0b1b396565e3); Time taken: 44.607 seconds Error: Error while compiling statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask. java.io.IOException: org.apache.tez.dag.api.TezUncheckedException: java.nio.file.AccessDeniedException: s3a://com.anush/opt/hive/scratch_dir/hive/_tez_session_dir/0c1896fa-2b9d-4461-9ab4-ced0fd46ef48: org.apache.hadoop.fs.s3a.auth.NoAuthWithAWSException: No AWS Credentials provided by TemporaryAWSCredentialsProvider SimpleAWSCredentialsProvider EnvironmentVariableCredentialsProvider IAMInstanceCredentialsProvider : com.amazonaws.SdkClientException: Unable to load AWS credentials from environment variables (AWS_ACCESS_KEY_ID (or AWS_ACCESS_KEY) and AWS_SECRET_KEY (or AWS_SECRET_ACCESS_KEY)) (state=08S01,code=1) 0: jdbc:hive2://localhost:10000/>

AwasthiSomesh avatar Sep 19 '24 15:09 AwasthiSomesh

Hi all any one please help here.

AwasthiSomesh avatar Sep 19 '24 16:09 AwasthiSomesh

@AwasthiSomesh hello. can you try apply one patch from me and try again?

BsoBird avatar Oct 18 '24 07:10 BsoBird

@pvary please tell will do it

AwasthiSomesh avatar Oct 18 '24 09:10 AwasthiSomesh

@AwasthiSomesh check u email.

BsoBird avatar Oct 18 '24 09:10 BsoBird

ERROR : FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask. java.io.IOException: org.apache.tez.dag.api.TezUncheckedException: java.nio.file.AccessDeniedException: s3a://com.anush/opt/hive/scratch_dir/hive/_tez_session_dir/0c1896fa-2b9d-4461-9ab4-ced0fd46ef48:

you don't have write permission to that path.

Tez should handle it better

if your bucket really is called "com.anush" no that S3AfS doesn't support that, amazon say "exclusively for web sites", and with good reason.

Also, that aws warning message about deprecation flags you are using a later version of the AWS SDK than any hadoop release. Your choice, but bear in mind it hasn't been qualified, and those SDKs can be fussy at times.

steveloughran avatar Oct 21 '24 19:10 steveloughran

@AwasthiSomesh any updates here?

Incidentally, that warning about an EOL for the AWS SDK comes about because the hadoop binaries you are using use the AWS v1 SDK. This is deprecated by AWS, though it does have the advantage of being far more resilient to transient S3 failures.

2024-08-22 13:50:11,434 WARN util.VersionInfoUtils: The AWS SDK for Java 1.x entered maintenance mode starting July 31, 2024 and will reach end of support on December 31, 2025. For more information, see https://aws.amazon.com/blogs/developer/the-aws-sdk-for-java-1-x-is-in-maintenance-mode-effective-july-31-2024/
You can print where on the file system the AWS SDK for Java 1.x core runtime is located by setting the AWS_JAVA_V1_PRINT_LOCATION environment variable or aws.java.v1.printLocation system property to 'true'`
  • you can shut this up by finding whichever aws class is printing the warning and setting it to only log at FAIL
  • you should upgrade to set of hadoop 3.4.1 binaries and the matching "bundle.jar" with all the AWS artifacts and shaded dependencies

steveloughran avatar Jan 15 '25 14:01 steveloughran

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

github-actions[bot] avatar Jul 15 '25 00:07 github-actions[bot]

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

github-actions[bot] avatar Jul 29 '25 00:07 github-actions[bot]