spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-48751][INFRA][PYTHON][TESTS] Re-balance `pyspark-pandas-connect` tests on GA

Open panbingkun opened this issue 1 year ago • 3 comments

What changes were proposed in this pull request?

The pr aims to re-balance pyspark-pandas-connect tests on GA.

Why are the changes needed?

Make the execution cost time of pyspark-pandas-connect-part[0-3] testing to a relatively average level, avoiding the occurrence of long tails and resulting in higher overall GA execution cost time.

Here are some currently observed examples:

  • https://github.com/apache/spark/pull/47135/checks?check_run_id=26784966983 image

    Most of them are around 1 hour, but part2 cost 1h 49m, part3 cost 2h 16m

  • https://github.com/panbingkun/spark/actions/runs/9693237300 image Most of them are around 1 hour, but part2 cost 1h 47m, part3 cost 2h 20m

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Manually observing the cost time of pyspark-pandas-connect-part[0-3].

Was this patch authored or co-authored using generative AI tooling?

No.

panbingkun avatar Jun 28 '24 02:06 panbingkun

Use the following steps to re-balance:

  • Download logs from GA, extract the execution cost time of each UT, and calculate the total execution cost time of each part*, eg: pyspark-pandas-connect-part0 image
2024-06-28T05:27:37.6183255Z Finished test(python3.11): pyspark.pandas.tests.connect.computation.test_parity_any_all (36s)
2024-06-28T05:29:27.2891428Z Finished test(python3.11): pyspark.pandas.tests.connect.computation.test_parity_apply_func (109s)
2024-06-28T05:29:51.3194664Z Finished test(python3.11): pyspark.pandas.tests.connect.computation.test_parity_binary_ops (24s)
2024-06-28T05:32:15.4889334Z Finished test(python3.11): pyspark.pandas.tests.connect.computation.test_parity_combine (144s)
2024-06-28T05:33:37.6599457Z Finished test(python3.11): pyspark.pandas.tests.connect.computation.test_parity_compute (82s)
2024-06-28T05:36:26.0724168Z Finished test(python3.11): pyspark.pandas.tests.connect.computation.test_parity_corr (168s)
2024-06-28T05:39:31.0848420Z Finished test(python3.11): pyspark.pandas.tests.connect.computation.test_parity_corrwith (185s)
2024-06-28T05:40:06.1762415Z Finished test(python3.11): pyspark.pandas.tests.connect.computation.test_parity_cov (35s)
2024-06-28T05:40:54.8319822Z Finished test(python3.11): pyspark.pandas.tests.connect.computation.test_parity_cumulative (48s)
2024-06-28T05:41:36.4479258Z Finished test(python3.11): pyspark.pandas.tests.connect.computation.test_parity_describe (41s)
2024-06-28T05:41:51.9689250Z Finished test(python3.11): pyspark.pandas.tests.connect.computation.test_parity_eval (15s)
2024-06-28T05:42:37.0018704Z Finished test(python3.11): pyspark.pandas.tests.connect.computation.test_parity_melt (45s)
...
  • Through statistics and analysis
UT name Total Cost Diff
pyspark-pandas-connect-part0 4075 s 4075 - 5187.5 = - 1112.5 s
pyspark-pandas-connect-part1 4087 s 4087 - 5187.5 = - 1100.5 s
pyspark-pandas-connect-part2 5371 s 5371 - 5187.5 = + 183.5 s
pyspark-pandas-connect-part3 7217 7217 - 5187.5 = + 2029.5 s
Avg Cost 5187.5 s
  • By the above Diff, move the possible UT components from the high cost time part* to the low cost time part* to achieve the final balance.

panbingkun avatar Jun 28 '24 11:06 panbingkun

After this pr:

  • First https://github.com/panbingkun/spark/actions/runs/9718805972 image part0, cost time: 1h 44m part1, cost time: 1h 40m part2, cost time: 1h 45m part3, cost time: 1h 44m

  • Second https://github.com/panbingkun/spark/actions/runs/9721535055 image part0, cost time: 1h 45m part1, cost time: 1h 43m part2, cost time: 1h 49m part3, cost time: 1h 45m

panbingkun avatar Jun 29 '24 04:06 panbingkun

cc @zhengruifeng and @itholic

HyukjinKwon avatar Jun 29 '24 10:06 HyukjinKwon

Looks fine for now, but maybe in the future we might need to separate this into more parts instead of just rebalancing if the number of test will be increased.

itholic avatar Jun 30 '24 22:06 itholic

Nope, actually splitting the build increases the usage of the resource so I asked to distribute existing test cases for now. We got a bit of pushes from ASF

HyukjinKwon avatar Jun 30 '24 23:06 HyukjinKwon

Merged to master.

HyukjinKwon avatar Jun 30 '24 23:06 HyukjinKwon

Late LGTM

zhengruifeng avatar Jul 01 '24 00:07 zhengruifeng

Nope, actually splitting the build increases the usage of the resource so I asked to distribute existing test cases for now. We got a bit of pushes from ASF

I'm actually quite curious, what does this mean? - We got a bit of pushes from ASF Does ASF require us to reduce the usage of resource?

panbingkun avatar Jul 01 '24 00:07 panbingkun

we now have limited resources. See also https://issues.apache.org/jira/browse/SPARK-48094

HyukjinKwon avatar Jul 01 '24 00:07 HyukjinKwon

we now have limited resources. See also https://issues.apache.org/jira/browse/SPARK-48094

Okay, I see, thanks.

panbingkun avatar Jul 01 '24 01:07 panbingkun