spark-operator icon indicating copy to clipboard operation
spark-operator copied to clipboard

Spark operator support scheduled log write to S3

Open melin opened this issue 1 month ago • 6 comments

What feature you would like to be added?

In Spark on Kubernetes, the most challenging task is log collection. If the Spark operator can achieve the function of regularly writing logs to object storage, it will greatly facilitate users. Similar to AWS EMR Spark Serverless: https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/logging.html

  1. If the driver and executor pods are running, they will periodically (every 15 seconds) retrieve the full console logs and write them to S3.
  2. Before the driver and executor pods are terminated, they will retrieve the full console logs and write them to S3.

Why is this needed?

No response

Describe the solution you would like

No response

Describe alternatives you have considered

No response

Additional context

No response

Love this feature?

Give it a 👍 We prioritize the features with most 👍

melin avatar Nov 10 '25 09:11 melin

With a sidecar or just a jvm thread collect logs of pod and append to s3 files is a alternative for this scope. And worked well in my production k8s.

Using the operator to collect and upload logs has a single point of failure issue.

shadowinlife avatar Nov 13 '25 02:11 shadowinlife

s3 files

The S3 file does not support appending for writing. How is the append writing implemented? Or should we perform a full write coverage at regular intervals (for example, every 10 seconds)?

melin avatar Nov 13 '25 03:11 melin

@ChenYi015 Has the community considered this feature? Just like Hadoop YARN, it has the ability to aggregate logs.

melin avatar Nov 14 '25 03:11 melin

@ChenYi015 Has the community considered this feature? Just like Hadoop YARN, it has the ability to aggregate logs.

@melin I have not considered this feature yet. In our case, we use SLS provided by Alibaba Cloud to collect Spark pod logs.

The S3 file does not support appending for writing. How is the append writing implemented?

In our case, we use OSS, which is a S3-compatible object storage and support appending write. So it is possible to collect pod logs timely by adding a sidecar container. Actually, we provided a Ray history server in this way.

@vara-bonthu @nabuskey Is there a better solution to collect Spark pod logs to S3?

ChenYi015 avatar Nov 14 '25 03:11 ChenYi015

My main consideration is to have an "out-of-the-box" feature that doesn't require overly complicated operations. Just need to configure ak/sk, endpoint and logpath. The data warehouse uses Spark for ETL and runs a large number of Spark tasks every day. The vast majority of task logs will not be reviewed. Only the failed tasks require the review of the logs. The cost of using object storage is lower.

melin avatar Nov 14 '25 03:11 melin

on GKE, I've been mounting the GCS bucket to the pod, with gcfuse.

not great, but it works at least.

BenCoughlan15 avatar Nov 21 '25 10:11 BenCoughlan15

@vara-bonthu @nabuskey Is there a better solution to collect Spark pod logs to S3?

If you don't want to use a log aggregation solutions, we usually recommend using log collectors like fluentbit at node level or pod level and write to S3. Logs are not appended to existing objects. New log objects are created instead.

nabuskey avatar Dec 11 '25 16:12 nabuskey