incubator-uniffle icon indicating copy to clipboard operation
incubator-uniffle copied to clipboard

[FEATURE] Print out the task shuffle write time statistics by table format in log

Open zuston opened this issue 1 year ago • 11 comments

Code of Conduct

Search before asking

  • [X] I have searched in the issues and found no similar issues.

Describe the feature

Currently, we can't find out the slow shuffle-server in the client side log, this is not ease to use, especially for server with terriable GC.

I hope the metric with the server dimension for one spark task could be shown like the following table format in the client side log.

Min 25th percentile Median 75th percentile Max
Shuffle Write Duration 10s 15s 25s 30s 4min
Shuffle Write Size / Records 20M / 10000 20M / 10000 20M / 10000 20M / 10000 20M / 10000
Shuffle server list 10.23.35.19-21001 10.23.35.134-21001 10.23.35.14-21001 10.23.315.14-21001 10.123.35.14-21001

Motivation

No response

Describe the solution

No response

Additional context

No response

Are you willing to submit PR?

  • [ ] Yes I am willing to submit a PR!

zuston avatar Jan 09 '24 03:01 zuston

Feel free to pick this up.

zuston avatar Jan 09 '24 03:01 zuston

Are you interested on this? @myandpr

zuston avatar Jan 09 '24 03:01 zuston

Are you interested on this? @myandpr

Fine,please assign to me, thanks @zuston !

myandpr avatar Jan 09 '24 03:01 myandpr

Should we use Spark's metrics instead of log?

jerqi avatar Jan 09 '24 03:01 jerqi

Should we use Spark's metrics instead of log?

This may be also valid for MR and Tez. BTW, the task metrics for low shuffle-server don't exist. I haven't seen this metric.

zuston avatar Jan 09 '24 06:01 zuston

Should we use Spark's metrics instead of log?

This may be also valid for MR and Tez. BTW, the task metrics for low shuffle-server don't exist. I haven't seen this metric.

This should be a metric instead of log. The log seems weird.

jerqi avatar Jan 09 '24 06:01 jerqi

So how to meet the requirement if using metrics ? @jerqi Can you give some ideas if you want

zuston avatar Jan 09 '24 07:01 zuston

So how to meet the requirement if using metrics ? @jerqi Can you give some ideas if you want

Spark metrics system allows user to add extra metrics. You can refer to https://simhadri-g.medium.com/custom-metrics-source-in-apache-spark-ca30a3b362dd

jerqi avatar Jan 09 '24 08:01 jerqi

So how to meet the requirement if using metrics ? @jerqi Can you give some ideas if you want

Spark metrics system allows user to add extra metrics. You can refer to https://simhadri-g.medium.com/custom-metrics-source-in-apache-spark-ca30a3b362dd

Looks good. ping @myandpr If you have any idea, feel free to discuss

zuston avatar Jan 09 '24 09:01 zuston

So how to meet the requirement if using metrics ? @jerqi Can you give some ideas if you want

Spark metrics system allows user to add extra metrics. You can refer to https://simhadri-g.medium.com/custom-metrics-source-in-apache-spark-ca30a3b362dd

Looks good. ping @myandpr If you have any idea, feel free to discuss

fine,I think I understand this feature. Let me look at the logic about spark metrics system.

myandpr avatar Jan 10 '24 03:01 myandpr

So how to meet the requirement if using metrics ? @jerqi Can you give some ideas if you want

Spark metrics system allows user to add extra metrics. You can refer to https://simhadri-g.medium.com/custom-metrics-source-in-apache-spark-ca30a3b362dd

Looks good. ping @myandpr If you have any idea, feel free to discuss

fine,I think I understand this feature. Let me look at the logic about spark metrics system.

Maybe we could introduce extra tab page in spark UI(runtime + history server) to show more infos about shuffle-servers, like kyuubi did. https://github.com/apache/kyuubi/tree/master/externals/kyuubi-spark-sql-engine/src/main/scala/org/apache/spark/ui

zuston avatar Feb 05 '24 06:02 zuston