skein
skein copied to clipboard
Adding HiveServer Credential Provider
Hello Jim, first I would like to thank you for implementing this library. We needed a way to launch Jupyter containers on our existing Hadoop and your libraries are fitting great. In our Hadoop, the users get the data by connecting to our Hive database. Since we have a kerberised cluster, the way to connect from a yarn container is to use delegation tokens.
I implemented the part that connects to Hive and obtains a delegation token that is added to the rest of the tokens. I used the Oozie implementation for inspiration (the interface CredentialProvider.java is also like there): https://github.com/apache/oozie/blob/master/core/src/main/java/org/apache/oozie/action/hadoop/Hive2Credentials.java
This is my first draft. Could you please have a look at it?
The Travis CI checks failed with message:
No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.
I think that the stalling is caused by the build and not by the new code. I would appreciate if I receive any guidance here.
I created a pull-request in yarnspawner as well: https://github.com/jupyterhub/yarnspawner/pull/17
Thank you for reviewing my code :). I understood your comments. I will come back with the answers and a new draft later today.
Thanks for working on this! Apologies for the delayed review here. I've left a few comments on the implementation.
A few general questions:
- Is
credential_providers
the best name for this field? What terminology do other systems use?- What other credential providers may a user want us to support?
- Are
uri
andprincipal
sufficient information for all other implementations?
Hi, I answered the questions:
- Is credential_providers the best name for this field? What terminology do other systems use?
- In Oozie it's called CredentialsProvider
- In Spark it's called HadoopDelegationTokenProvider
- Researching this I found that Hadoop has a CredentialProvider API, which is more general than what we want.
Since we are only dealing with Delegation token maybe we can rename "credential_providers" to be more specific: hadoop_delegation_token_provider
?
This name will be in line with the Hadoop & Kerberos book. It is mentioned there "delegation token" and "Hadoop tokens".
-
What other credential providers may a user want us to support? From reading Oozie and Spark code: HCAT (Hive Metastore), Hbase, JHS (Hadoop Job History Server), Kafka.
-
Are uri and principal sufficient information for all other implementations? After some research, the answer is a sad no.
- Hbase uses the Hbase client configuration and the input Hadoop job config. received as input.
- I think that Hadoop Job History Server uses the Hadoop configuration.
- Kafka similarly has it's own configuration: KafkaTokenClusterConf
I was thinking, can we have a dictionary<str, str>
in protobuf that can be filled with whatever configuration (as keys with values) each provider needs, all bundled together?
Because It might not be very nice to change the protobuf everytime we add a provider.
Hi Jim, my second draft is ready for review. I think I covered all the things you mentioned above. If you could have a look when you have time, that would be great :).
Thanks @georgepachitariu. I'm currently on break between jobs (and without a computer), but will try to look at the changes as soon as I can (max 2 weeks from now). Apologies, thanks for being patient.
No worries @jcrist, take your time.
This is amazing effort at enabling good support for exhaustive services in hadoop, thanks! Any ETA on when can we expect this to be available for use?
Hi all. I just started a new job, so a fair bit of my time is occupied ramping up on that. I expect to be able to give this a good review by end-of-week though. Thanks for your patience.
This is very nice PR, exactly what we currently need for our platform. How I can help to have it merged ? Really looking forward to help and to have it in master.
Hi @santosh-d3vpl3x @wundervaflja. It's very cool to see that other people have the same ideas as me. As Jim said, please be a little patient. We will work towards a solution we like. If you like to live on the edge, you can install this branch in your deployment. (That's what I did :D )
Hello, do you have an estimate on the completeness of this PR ? I'm really interested, and ready to help if needed.
Hello, do you have an estimate on the completeness of this PR ? I'm really interested, and ready to help if needed.
Hi @gboutry, nice to meet you! Sadly this branch didn't get merged (and I don't work on data engineering systems anymore), BUT the branch has the complete functionality. So I would advise you to build using this branch and try it out.
Hi @georgepachitariu,
Sorry for the late reply, I was able to test your work, and indeed it works as expected, you did a really good job. Many thanks
(I rebased your branch on skein master, and it worked well)