skein icon indicating copy to clipboard operation
skein copied to clipboard

Adding HiveServer Credential Provider

Open georgepachitariu opened this issue 4 years ago • 14 comments

Hello Jim, first I would like to thank you for implementing this library. We needed a way to launch Jupyter containers on our existing Hadoop and your libraries are fitting great. In our Hadoop, the users get the data by connecting to our Hive database. Since we have a kerberised cluster, the way to connect from a yarn container is to use delegation tokens.

I implemented the part that connects to Hive and obtains a delegation token that is added to the rest of the tokens. I used the Oozie implementation for inspiration (the interface CredentialProvider.java is also like there): https://github.com/apache/oozie/blob/master/core/src/main/java/org/apache/oozie/action/hadoop/Hive2Credentials.java

This is my first draft. Could you please have a look at it?

georgepachitariu avatar Mar 10 '20 16:03 georgepachitariu

The Travis CI checks failed with message: No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.

I think that the stalling is caused by the build and not by the new code. I would appreciate if I receive any guidance here.

georgepachitariu avatar Mar 13 '20 10:03 georgepachitariu

I created a pull-request in yarnspawner as well: https://github.com/jupyterhub/yarnspawner/pull/17

georgepachitariu avatar Mar 13 '20 11:03 georgepachitariu

Thank you for reviewing my code :). I understood your comments. I will come back with the answers and a new draft later today.

georgepachitariu avatar Mar 17 '20 14:03 georgepachitariu

Thanks for working on this! Apologies for the delayed review here. I've left a few comments on the implementation.

A few general questions:

  • Is credential_providers the best name for this field? What terminology do other systems use?
  • What other credential providers may a user want us to support?
  • Are uri and principal sufficient information for all other implementations?

Hi, I answered the questions:

  1. Is credential_providers the best name for this field? What terminology do other systems use?

Since we are only dealing with Delegation token maybe we can rename "credential_providers" to be more specific: hadoop_delegation_token_provider ? This name will be in line with the Hadoop & Kerberos book. It is mentioned there "delegation token" and "Hadoop tokens".

  1. What other credential providers may a user want us to support? From reading Oozie and Spark code: HCAT (Hive Metastore), Hbase, JHS (Hadoop Job History Server), Kafka.

  2. Are uri and principal sufficient information for all other implementations? After some research, the answer is a sad no.

  • Hbase uses the Hbase client configuration and the input Hadoop job config. received as input.
  • I think that Hadoop Job History Server uses the Hadoop configuration.
  • Kafka similarly has it's own configuration: KafkaTokenClusterConf

I was thinking, can we have a dictionary<str, str> in protobuf that can be filled with whatever configuration (as keys with values) each provider needs, all bundled together? Because It might not be very nice to change the protobuf everytime we add a provider.

georgepachitariu avatar Mar 17 '20 19:03 georgepachitariu

Hi Jim, my second draft is ready for review. I think I covered all the things you mentioned above. If you could have a look when you have time, that would be great :).

georgepachitariu avatar Mar 20 '20 17:03 georgepachitariu

Thanks @georgepachitariu. I'm currently on break between jobs (and without a computer), but will try to look at the changes as soon as I can (max 2 weeks from now). Apologies, thanks for being patient.

jcrist avatar Mar 24 '20 15:03 jcrist

No worries @jcrist, take your time.

georgepachitariu avatar Mar 24 '20 18:03 georgepachitariu

This is amazing effort at enabling good support for exhaustive services in hadoop, thanks! Any ETA on when can we expect this to be available for use?

santosh-d3vpl3x avatar Apr 21 '20 21:04 santosh-d3vpl3x

Hi all. I just started a new job, so a fair bit of my time is occupied ramping up on that. I expect to be able to give this a good review by end-of-week though. Thanks for your patience.

jcrist avatar Apr 21 '20 22:04 jcrist

This is very nice PR, exactly what we currently need for our platform. How I can help to have it merged ? Really looking forward to help and to have it in master.

wundervaflja avatar Apr 23 '20 07:04 wundervaflja

Hi @santosh-d3vpl3x @wundervaflja. It's very cool to see that other people have the same ideas as me. As Jim said, please be a little patient. We will work towards a solution we like. If you like to live on the edge, you can install this branch in your deployment. (That's what I did :D )

georgepachitariu avatar Apr 28 '20 10:04 georgepachitariu

Hello, do you have an estimate on the completeness of this PR ? I'm really interested, and ready to help if needed.

gboutry avatar Nov 16 '22 19:11 gboutry

Hello, do you have an estimate on the completeness of this PR ? I'm really interested, and ready to help if needed.

Hi @gboutry, nice to meet you! Sadly this branch didn't get merged (and I don't work on data engineering systems anymore), BUT the branch has the complete functionality. So I would advise you to build using this branch and try it out.

georgepachitariu avatar Nov 16 '22 23:11 georgepachitariu

Hi @georgepachitariu,

Sorry for the late reply, I was able to test your work, and indeed it works as expected, you did a really good job. Many thanks

(I rebased your branch on skein master, and it worked well)

gboutry avatar Nov 25 '22 23:11 gboutry