community
community copied to clipboard
Adoption of Spark-on-k8s-operator
We are looking for a new home for Spark-on-k8s-operator. The project was quite active for years, delivering a convenient way of running Spark in the Kubernetes environment. Unfortunately, due to some org changes, the previous maintainers are unable to provide enough time and love the project and its users deserve. So, GoogleCloudPlatform would like to transfer ownership of the code (already on the Apache license) to an organisation that would help to bring more life to the project and continue to help users run Spark on K8S. Given that you support a wide variety of ML/batch frameworks (MPI, TF, Pytorch etc) we think that Kubeflow would be a good place for the Spark operator.
cc: @terrytangyuan
+1 happy to sponsor this. This would be a great addition to the Kubeflow community. cc @james-jwu @theadactyl
cc @kubeflow/wg-training-leads
Thank you for proposing this @mwielgus!
I agree, that Spark operator might be useful for Kubeflow users who want to do Data Preparation, Feature extraction, Data Validation, etc. before building and training their ML models. Currently, Kubeflow doesn't offer such functionality.
It would be nice if you could join our upcoming AutoML and Training WG Community call today (September 6th) at 6pm UTC (10am PST) to discuss the details and potential use-cases.
cc @kubeflow/wg-training-leads @tenzen-y @kuizhiqing
Is this proposal to haev Spark operator to be independent operator in Kubeflow?
Thanks, I will join the meeting today :).
Basically, SGTM. However, I have the same question as @johnugeorge said.
FYI, The Kubeflow User Survey(s) have consistently shown that users would like a Spark / Kubeflow integration.
We will discuss if and how Kubeflow will support a Spark K8s operator in our Community Meeting on Tuesday, please find the bridge in these meeting notes. I suspect there maybe several operators or implementations and we need to decide if we are going to pick one, how it will be supported, if it is part of a (new) Kubeflow Working Group, how it is installed, etc. @kimwnasptd @mwielgus Kubeflow community meeting notes: https://docs.google.com/document/d/1Wdxt1xedAj7qF_Rjmxy1R0NRdfv7UWs-r2PItewxHpE/edit.
@thesuperzapper
mwielgus we had a spark operator before. Are they using the modern sparkconnect? https://spark.apache.org/docs/latest/spark-connect-overview.html
You can already use the kubernetes apiserver as spark master. So i am wondering whether that + spark-connect is alaready enough. Anyway i am open to contributions in manifests/contrib.
Here is the recording for our initial discussion on Sep 9th around Spark Operator in Kubeflow: https://youtu.be/3D2h5OUNCQo.
@mwielgus Please can you attend Kubeflow Community call today at 8:00am PST, so we can have a followup discussion around Spark Operator: https://docs.google.com/document/d/1Wdxt1xedAj7qF_Rjmxy1R0NRdfv7UWs-r2PItewxHpE/edit#heading=h.xtqde2br5mh4.
cc @kubeflow/wg-training-leads
@andreyvelich I will be there.
Thank you, Marcin!
As a follow-up to our recent Apache Spark discussions in the Kubeflow Community meetings, we are requesting some user input... If you are a Spark user or contributor, the Kubeflow Community would like to know if you need active support for a Spark Kubernetes Operator. If so, would you please comment or +1 on this GitHub issue. We need at least 10 users and would appreciate any ideas on use cases i.e. integration with notebooks or Kubeflow pipelines. Thanks! Josh
IMO, the fundamental gap is the lack of an SDK. Data scientists would rather write python than yaml (for good reason). There needs to be (a) some clarification (and documentation) about the benefits of the spark operator over pyspark, and (b) development of an SDK (perhaps an extension to the training operator SDK).
@droctothorpe, We currently use the SparkOperator in a few of our projects, It makes it easy for us to deploy Spark Jobs "natively" on K8s, more like how the training operators currently work, so I am not sure what you mean by lack of SDK here.
We use it with Kubeflow Pipeline DSL
spark_json_template = Template("""
{
"apiVersion": "sparkoperator.k8s.io/v1beta2",
"kind": "SparkApplication",
"metadata": {
"name": "hello-pipeline",
"namespace": "kubeflow"},
"spec": {
"type": "Scala",
"mode": "cluster",
"mainApplicationFile": "$jar_location"
}""")
spark_json = spark_json_template.substitute({'jar_location': jar_location})
spark_job = json.loads(spark_json)
spark_resource = dsl.ResourceOp(
name='spark-job',
k8s_resource=spark_job,
success_condition='status.state == Succeeded')
...
+1 on this issue, It will be great for SparkOperator to find a new home here
@charlesa101 that's json with no customization, and the configuration options are abundant. It's nice to be able to just use ResourceOp though. Thanks for sharing.
Our platform provides both pyspark and Spark Operator support, and the overwhelming majority of users prefer pyspark. That's just one data point though. IMO, a proper, python interface ala the training operator SDK (or pyspark) would promote adoption.
@droctothorpe This is based on CRD for SparkOperator, This will work in the same way for PySpark. I'm curious to know more about how your PySpark Operator implementation works. The configurations are abundant but I am not sure there will be a use case for you to have to load up all the configs
I agree with you that It will be great to eventually align the behavior of this operator with the training operators to make it easy to use but I am not sure what you still mean by SDK in this context! Once you have the YAML and CRDs well-defined you can easily use them in your KFP as a component
Here's the Python SDK for training-operator. Basically instead of writing YAML and use it in your KFP component, you can use Python to define and submit jobs directly.
https://github.com/kubeflow/training-operator/tree/master/sdk/python
Oh I see what you mean, thanks @terrytangyuan 👍
What should be the next steps? Do you have enough data points about Spark in Kubeflow?
Hi @mwielgus, please can you join Kubeflow Community call next Tuesday on October 31st 8:00am PST ? We can discuss the next steps and possibilities to move this forward.
Also @thesuperzapper can share some details around using Spark with Kubeflow Notebooks 2.0 (e.g. Kubeflow Workspaces).
@andreyvelich Yes, I will be there.
We had a great discussion around adoption of Spark Operator during KubeCon with @mwielgus and @vara-bonthu. We might be able to find folks who can maintain this project moving forward. Let's have a chat tomorrow during Kubeflow Community Call (November 14th at 8:00am PST).
@jbottum We will provide more updates during the call and discuss the next steps.
Hi Everyone, as we discussed on the latest Kubeflow community call we started this doc to donate Spark Operator to Kubeflow: https://docs.google.com/document/d/1rCPEBQZPKnk0m7kcA5aHPf0fISl0MTAzsa4Wg3dfs5M/edit#heading=h.z7wqs2ebrwra Please take a look and provide your comments. It would be great if we could quickly discuss it during our today's Kubeflow Community Call at 8am PST (cc @mwielgus @vara-bonthu).
cc @kubeflow/project-steering-group @kubeflow/wg-pipeline-leads @kubeflow/wg-training-leads @kubeflow/wg-notebooks-leads
I am looking forward to the adoption of the Google's Spark K8s Operator, which will contribute to building a larger community and potentially become the official Spark Operator for Apache Spark.
As part of this efforts, it is crucial to establish support for a single official Spark Kubernetes Operator within the Apache Spark community. Collaboration with Apache Spark and gaining their endorsement is of utmost importance in this context.
This collaboration will serve to prevent the Apache Spark community from introducing an entirely new Spark Operator, akin to Apache Flink, which offers an official Flink Operator for Kubernetes. This approach helps avoid potential confusion within the community and ensures that users gravitate toward the approved Apache Spark Operator tool.
cc @yuchaoran2011
If you want the operator to become even semi-"official" it should be donated to the ASF instead. The ASF - in general - does not give any product the recognition of being the "official X for Y" or the "approved". (I say this as a member of the ASF but not with any special knowledge or any special powers, just from my knowledge of the policies - especially around trademarks). https://www.apache.org/foundation/marks/
While we're at it: The current name "Google's Spark K8s Operator" might be a violation of the trademark policy already. I suggest clarifying with the ASF before adopting the name. The usual "approved" naming scheme is "XYZ for Apache Foo". In this case: Google's Kubernetes operator for Apache Spark" (or similar) It needs to be made clear, in naming, documentation and communication that this is in no way officially affiliated with the ASF.
With my other hat - as a co-founder of Stackable I'd like to point to another operator for Apache Spark which already exists (built by us): https://github.com/stackabletech/spark-k8s-operator/ and which we recently compared to the Google one.
Happy to help with any ASF related communication.
I agree with @lfrancke on the point of donating to the ASF if you want to make it even semi official. In the Apache YuniKorn community we see a number of groups using the operator. Most of them have made changes to the operator to fix issues or integrate with newer versions of Apache Spark.
If you want the operator to become even semi-"official" it should be donated to the ASF instead.
IMO, "official" should only be earned by merit and community adoption. Although donating to ASF helps the legal side, CNCF provides a good community around K8s and cloud-native technologies.
With my other hat - as a co-founder of Stackable I'd like to point to another operator for Apache Spark which already exists (built by us): https://github.com/stackabletech/spark-k8s-operator/ and which we recently compared to the Google one.
Out of curiosity, why not join the effort of maintaining the existing Spark Operator that's already widely adopted?
I don't think this discussion is about trying to present the Google Spark operator as an "official" option (from a Spark or even a Kubeflow perspective), it's simply about giving a new home for the existing users and contributors of GoogleCloudPlatform/spark-on-k8s-operator
on the Kubeflow org, so they can continue working on it in a neutral place, rather than continue struggling under their current home.
It's up to the maintainers of GoogleCloudPlatform/spark-on-k8s-operator
to decide where they want to live, and in this specific case, it seems like they need a solution in the short-term solution to prevent those contributors/users from being stuck and unable to continue development.
Longer term, there is a strategic question about whether all three operators can be merged (including the Stackable one and the one that Apple was proposing to donate to the ASF), but I don't think that needs to block this donation, if all parties are willing.