[SPARK-49249][SPARK-49320] Add new tag-related APIs in Connect back to Spark Core
What changes were proposed in this pull request?
This PR adds several new tag-related APIs in Connect back to Spark Core. Following the isolation practice in the original Connect API, the newly introduced APIs also supports isolation:
interrupt{Tag,All,Operation}can only cancel jobs created by this Spark session.{add,remove}Tagand{get,clear}Tagsonly apply to jobs created by this Spark session.
Unlike related APIs in SparkContext, All the above APIs are blocking, which means that the caller thread is blocked while jobs are being cancelled.
Why are the changes needed?
To close the API gap between Connect and Core.
Does this PR introduce any user-facing change?
Yes, Core users can use some new APIs.
How was this patch tested?
New test added.
Was this patch authored or co-authored using generative AI tooling?
No.
Could we file a JIRA for Python API set too? Just to make sure we don't miss it out
Could we file a JIRA for Python API set too? Just to make sure we don't miss it out
Done! https://issues.apache.org/jira/browse/SPARK-49337
@HyukjinKwon @hvanhovell This PR is now ready for review. Could you take a look? Thanks!
I feel like what you're doing here is similar with JobArtifactSet. It has things to do with SparkContext but we separated them to JobArtifactSet with a state so we can decouple Spark core from Spark SQL.
I feel like what you're doing here is similar with
JobArtifactSet. It has things to do withSparkContextbut we separated them toJobArtifactSetwith a state so we can decouple Spark core from Spark SQL.
Yes exactly. Basically the equivalent of JobArtifactSet.withActiveJobArtifactState is SparkSession.withActive.
LGTM. I left a few minor comments. Let me know if you want to address now, or in a follow-up? Two follow-ups here: We need to add this pyspark, and we need to homogenize this with the connect implementation.
I'll address most comments in this PR. Currently, I am being distracted by something else, but will come back very soon.
Merging to master.
FYI, there are two open JIRA issues in the interrupt and cancellation area.
Ping once more, @xupefei and @hvanhovell . Could you fix the flakiness or disable it (if you are busy), please?
- https://github.com/apache/spark/actions/runs/11353570330/job/31654735430
SparkSessionJobTaggingAndCancellationSuite:
...
- Cancellation APIs in SparkSession are isolated *** FAILED ***
@xupefei mind taking a look please?
On it.
Trying out a fix at https://github.com/apache/spark/pull/48622.