data-on-eks
data-on-eks copied to clipboard
[Feature] dbt on EMR on EKS
Community Note
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
What is the outcome that you are trying to reach?
- Performing data transformation using dbt (data build tool) on EKS.
Describe the solution you would like
- Extend the existing Spark Thrift Server class to run indefinitely and deploy as a Spark Job
- Deploy a service with a load balancer for external connection
- Set up a dbt project with the dbt-spark adapter where connection is made via the Spark Thrift Server.
Describe alternatives you have considered
- Possibly using Apache Kyuubi for externalize the Spark Thrift Server but it'd be too much.
Additional context
- I'm happy to contribute to this feature. I just need a bit of help.
@jaehyeon-kim Thanks and dbt will be very useful for the community. Would this be possible with EMR ON EKS? If not show them with OSS Spark.
Let us know if you need any help.
@vara-bonthu
The dbt-spark adapter supports odbc
, thrift
and http
connection methods. Only the thrift
method is supported for OSS Spark. If it is EMR on EC2, the spark thrift server can be started in the master node easily. However long running thrift server is not supported by EMR on EKS (Spark on Kubernetes in general) and we need a tweak. We can extend the existing spark thrift server class to run indefinitely. It works on my POC and I need someone who can help check the build configuration - I'm not a Java developer and it should be updated. Let me update it shortly.
Hi @vara-bonthu
The menu bar and main page include existing sections. Which place would be good for dbt
? Could you please create a skeleton for it if necessary? Or please inform me where to put the dbt
contents.
Could you please provide full details of your implementation so that i guide you accordingly?
If you are building a new Terraform blueprint for deploying dbt then you can place the code under https://github.com/awslabs/data-on-eks/tree/main/analytics/terraform/dbt-on-eks and the docs can go here -> https://github.com/awslabs/data-on-eks/tree/main/website/docs/spark-on-eks.
Please feel free to raise a PR so that i can suggest the location changes after reviewing the PR
Hi @vara-bonthu
How are you?
Sorry for replying late. These days I find it hard to save time for this as my wife has a knee injury and I need to support her. Also I've got a 2-year-old baby who also needs care from me. I'll try to come back shortly with an example as per your comment.
Cheers, Jaehyeon
Hey @jaehyeon-kim, Thanks for the response. No worries. Take your time and its not an urgent task.
Hey, any update for the enhancement?
I didn't have time as I worked more on real time processing. Now I return to work on dbt a bit and would be able to update it. Let me keep you updated.
I am working on Kyuubi with EMR on EKS, which supports JDBC/Thrift/HTTP connections. Will that meet dbt's need?
This issue has been automatically marked as stale because it has been open 30 days with no activity. Remove stale label or comment or this issue will be closed in 10 days
Hello, Any update for the enhancement?