spark-rapids [FEA] make nighlty build and test scripts easy-to-use in non-jenkins env

Is your feature request related to a problem? Please describe. Make existing nightly scripts like https://github.com/NVIDIA/spark-rapids/blob/branch-22.10/jenkins/spark-nightly-build.sh https://github.com/NVIDIA/spark-rapids/blob/branch-22.10/jenkins/spark-tests.sh

So developers could simply try the while process locally in local or other ENVs.

Aug 08 '22 04:08 pxLi

@NvTimLiu to help investigate if any good solution, thanks~

Aug 08 '22 07:08 pxLi

There are some variables need to be initialized, checking

Aug 12 '22 00:08 NvTimLiu

To run spark-nightly-build.sh , you need to install build environment on your host as described in: https://github.com/NVIDIA/spark-rapids/blob/branch-22.10/jenkins/Dockerfile-blossom.ubuntu

To run spark-tests.sh , you need to install build environment on your host as described in (similar for centos/rocky): https://github.com/NVIDIA/spark-rapids/blob/branch-22.10/jenkins/Dockerfile-blossom.integration.ubuntu I mean not building jars head with spark-nightly-build.sh

The best practice is to build out the docker images with the docker files, the run build/test inside the docker images.

I could SUCCESSFULLY run both the nightly scripts locally.

https://github.com/NVIDIA/spark-rapids/blob/branch-22.10/jenkins/spark-nightly-build.sh https://github.com/NVIDIA/spark-rapids/blob/branch-22.10/jenkins/spark-tests.sh

To run spark-tests.sh, you need to build jars with spark-nightly-build.sh ahead, then you will get the test SNAPSHOT jars locally. The best practice is to run spark-nightly-build.sh scripts in the test docker container (e.g. building from jenkins/Dockerfile-blossom.integration.ubuntu), the run the spark-tests.sh in the same container.

Aug 16 '22 12:08 NvTimLiu

For spark-nightly-build.sh, Need to set SKIP_DEPLOY=true to avoid pushing jars maven repo.

For spark-tests.sh, need to change to dowload spark binary from spark public archive webpage.

https://github.com/NVIDIA/spark-rapids/blob/branch-22.10/jenkins/spark-tests.sh#L99-L100

Aug 22 '22 05:08 NvTimLiu

The build/test scripts takes long to to finish.

The build scripts spark-nightly-build.sh take 6 hours.

The test scripts spark-tests takes 2 hours X one spark shims, all the tests with 8 shims will take a whole day if we run it one my local host with one GPU card.

@gerashegalov I was told this issue/requirement was originated from you, could you please help to check if the above conclusions what we expected?

Aug 22 '22 10:08 NvTimLiu

Hi @NvTimLiu ,

I have recently had two scenarios.

An invasive PR I knew will pass premerge but would fail an unidentified set of integration tests in nightly tests.
A PR that fails a single premerge test on CI and when I rerun it with the same command line locally it will pass. I could repro only after editing a bunch of lines in the script to run it in a local container and it finally reproduced the issue reliably

For Scenario 1: Knowing that running all nightly tests locally takes a long time I'd like to be able to submit a CI job to run all nightly tests before I merge the PR

Now suppose some test failed in that CI job which is like Scenario 2. Then I want to run this single test locally the same way as CI to reproduce the failure

docker run --gpus all -it <ci-image> <repro-command-for-a-failing-test>

Bonus points if the above command would mount my local repo for quick iterations without requiring local commits.

Aug 22 '22 15:08 gerashegalov

Hi @gerashegalov I got your scenarios, For scenario 1, the nightly BUILD+TESTs CI process is pretty complicated, and takes a lot of time and CPU/GPU resources, due to the "dependencies and downstream triggers of spark-rapids-JNI(1.5h)/DB build(2h)/nightly build(6h)/nightly tests [x7 GPUs parallelly] (2h)".
In a word, we do not have enough resources to trigger extra nightly CI jobs besides the regular one.

For scenario 2, it is practical, we can load the FAILURE workspace into the premerge docker container to quick restore the test environment, let me try this first.

Aug 23 '22 13:08 NvTimLiu

It's not ideal for but the resource problem can be mitigated by batching multiple PRs. Let us say we have a GitHub PR label ready-for-nightly or something like that. And there are a few scheduled daily/nightly CI jobs trying to merge as many of such PRs into a test branch as possible with a merge conflict. Then it executes a single nightly test run in CI against that cumulative test branch.

Aug 24 '22 06:08 gerashegalov

This is a practical way if we aggregate/merge several ready-for-nightly PRs together, then run the nightly CI, a single nightly test run in CI against that cumulative test branch.

BYW, does a single nightly test run mean we and run tests against one spark-version, or all the spark-versions? @gerashegalov

Aug 29 '22 01:08 NvTimLiu

It's not ideal for but the resource problem can be mitigated by batching multiple PRs. Let us say we have a GitHub PR label ready-for-nightly or something like that. And there are a few scheduled daily/nightly CI jobs trying to merge as many of such PRs into a test branch as possible with a merge conflict. Then it executes a single nightly test run in CI against that cumulative test branch.

We have the resource problem that we don't run full testing in the pre-merge pipeline. And the full test pipeline(like nightly build to nightly IT) takes a long time. I don't think we should spend time to add another pipeline to support 'ready-for-nightly'. If we think a frequent full testing is required, we can simply change the nightly build & test more frequently like every 12 hours or less to detect the problem.

I'd like to encourage developers to run the testing locally before merging the PR. For example, we know the pre-merge will run IT with which version of spark. And the developers should run the others if it's not covered by pre-merge and we know it should run. Should the problem be how easy a developer can run IT locally against some versions of spark?

Aug 29 '22 08:08 GaryShen2008

@GaryShen2008 Running locally is a practice with major deficiencies

Practice shows there are always non-trivial discrepancies in the local environment and the CI in terms of hardware, OS, NVIDIA driver versions, a bunch of environment variables that affect the test execution. In my experience, there is always a non-negligible chance that the test suite passing locally will fail on CI
Running locally is also much more expensive considering that it takes away from developer's productivity.

Aug 29 '22 17:08 gerashegalov

I think we may just wait for more resources to get ready. Temp workarounds can not completely fulfill the requirements here, and would spend us much time developing and mainly maintenance efforts

Let us say we have a GitHub PR label ready-for-nightly or something like that. And there are a few scheduled daily/nightly CI jobs trying to merge as many of such PRs into a test branch as possible with a merge conflict. Then it executes a single nightly test run in CI against that cumulative test branch.

this sounds like it may introduce more chaos to pre-merge phase. I am not sure if any other projects have some successful cases w/ it

Aug 31 '22 05:08 pxLi

I am pretty sure it's not going to be a temporary workaround but a reasonable feature because the struggle for resources is a constant you can bet on in my view. So this is a future proof generic feature that covers a range of situations with n=1 for some rare PRs where you want/must to test in isolation or > 1 for the normal case where PRs can be easily pooled together for testing.

this sounds like it may introduce more chaos to pre-merge phase.

I don't see how you can make this conclusion. How is premerge going to be affected by a single/few (based on the acceptable and approved cadence) runs a day? That's a whole point of batching PRs.

I am not sure if any other projects have some successful cases w/ it

Unique problems require unique solutions. But I don't think it's that unique.

Sep 01 '22 18:09 gerashegalov

I don't see how you can make this conclusion. How is premerge going to be affected by a single/few (based on the acceptable and approved cadence) runs a day? That's a whole point of batching PRs.

It the test passed, everything is fine. But if failures, people would spend much more time investigating errors as the increased complexity of debugging in REF of multiple merged commits. Even there is actually no fault in their own PRs, but the merged REF CI run would block their merge
Current nightly CIs do the same job IMO, I do not see any huge benefit to put it as pre-merge

Sep 02 '22 00:09 pxLi

yes that's a trade off. You can control by limiting the number of PRs.
We are not talking about making it a mandatory premerge. My proposal is to provide to developers as an option.

Our test code and script is full of places making the local test environment different from CI. Recent examples are

https://github.com/NVIDIA/spark-rapids/pull/6358#issuecomment-1220970470
https://github.com/NVIDIA/spark-rapids/issues/5714
https://github.com/NVIDIA/spark-rapids/issues/6444

After running a test locally with success. and you merged the PR, moved on to new tasks, chances are high that nightly tests will come with a bunch of problems incurring high mental context switch cost.

There are two solution avenues to this problem:

Make the the tests more deterministic and close the gap between CI and local. I think this is very difficult to achieve based on the vast differences in hardware and the time it will take to remove nondeterminism from the test code, and preventing it from being reintroduce.
Accept the nondeterminism, allow developers to run tests on the CI before checking in to minimize incurred tech debt . For complex PRs I'd rather wait for a few days before merging them but have much higher confidence that the project is actually DONE.

Sep 02 '22 02:09 gerashegalov