initialization-actions icon indicating copy to clipboard operation
initialization-actions copied to clipboard

[rapids]split spark-rapids and rapids init scripts

Open nvliyuan opened this issue 2 years ago • 41 comments

Signed-off-by: liyuan [email protected] restructure rapids init scripts, split spark-rapids and rapids to the individual script

nvliyuan avatar Sep 21 '22 08:09 nvliyuan

@abellina @tgravescs @medb @mengdong FYI that this is to use a single init script(named spark-rapids.sh) instead of the install_gpu_driver.sh + rapids.sh for Dataproc + Spark RAPIDS as requested by our product team.

I am inviting you to help review since i know you may be impacted by this change.

viadea avatar Sep 21 '22 15:09 viadea

Please add more description as to what this really is and how user would now call it. Does this support MIG properly? I assume the install_gpu_driver stays where it is in case you aren't using Rapids?

tgravescs avatar Sep 21 '22 15:09 tgravescs

Please add more description as to what this really is and how user would now call it. Does this support MIG properly? I assume the install_gpu_driver stays where it is in case you aren't using Rapids?

Currently we just need to use a single spark-rapids/spark-rapids.sh as init script. Basically we just combined the 2 install_gpu_driver.sh + rapids.sh logic. In this PR, we have not touched the old ones but eventually we need to remove the logic of spark-rapids in the old 2 scripts(maybe in a different pr) Note: the old 2 init scripts are still needed by other GPU related projects such as DASK. And this is the reason we want to decouple from DASK/other projects so it can make our future change easier without considering other projects.

Yes this script supports MIG(though i have not tested it yet). @nvliyuan have you tested MIG using the new script?

viadea avatar Sep 21 '22 15:09 viadea

/gcbrun

jayadeep-jayaraman avatar Sep 23 '22 04:09 jayadeep-jayaraman

The tests are failing with the below error

2022-09-23T05:13:04.012864820Z Starting local Bazel server and connecting to it...
2022-09-23T05:13:14.378701499Z ERROR: Skipping ':test_spark_rapids': no such target '//:test_spark_rapids': target 'test_spark_rapids' not declared in package '' defined by /init-actions/BUILD
2022-09-23T05:13:14.387768537Z ERROR: no such target '//:test_spark_rapids': target 'test_spark_rapids' not declared in package '' defined by /init-actions/BUILD
2022-09-23T05:13:14.556798923Z INFO: Elapsed time: 18.992s
2022-09-23T05:13:14.560858649Z INFO: 0 processes.

I believe this is happening because __init__.py file is missing in the folder. Please check

jayadeep-jayaraman avatar Sep 23 '22 05:09 jayadeep-jayaraman

Yes this script supports MIG(though i have not tested it yet). @nvliyuan have you tested MIG using the new script?

there is an issue with MIG, working on it

nvliyuan avatar Sep 23 '22 06:09 nvliyuan

Yes this script supports MIG(though i have not tested it yet). @nvliyuan have you tested MIG using the new script?

there is an issue with MIG, working on it

just verified image

nvliyuan avatar Sep 23 '22 06:09 nvliyuan

The tests are failing with the same error. Looks like this file also needs to be updated https://github.com/GoogleCloudDataproc/initialization-actions/blob/master/BUILD

jayadeep-jayaraman avatar Sep 23 '22 08:09 jayadeep-jayaraman

The tests are failing with the same error. Looks like this file also needs to be updated https://github.com/GoogleCloudDataproc/initialization-actions/blob/master/BUILD

@jayadeep-jayaraman updated, could you help verify again?

nvliyuan avatar Sep 23 '22 09:09 nvliyuan

Looks like in our init action script there is special treatment when the folder name has dashes as documented here

This is also showing up in the error below where the results are different for gpu and spark-rapids

2022-09-23T09:55:15.107643872Z + for changed_dir in "${changed_dirs[@]}"
2022-09-23T09:55:15.107651477Z + local test_name=gpu
2022-09-23T09:55:15.107682483Z + [[ gpu == *\-* ]]
2022-09-23T09:55:15.107692627Z + local test_target=gpu:test_gpu
2022-09-23T09:55:15.107700453Z + TESTS_TO_RUN+=("${test_target}")
2022-09-23T09:55:15.107707766Z + for changed_dir in "${changed_dirs[@]}"
2022-09-23T09:55:15.107714512Z + local test_name=spark-rapids
2022-09-23T09:55:15.107721576Z + [[ spark-rapids == *\-* ]]
2022-09-23T09:55:15.108015425Z + local test_target=:test_spark_rapids
2022-09-23T09:55:15.108028550Z + TESTS_TO_RUN+=("${test_target}")
2022-09-23T09:55:15.108061537Z + echo 'Tests: gpu:test_gpu :test_spark_rapids'
2022-09-23T09:55:15.108070484Z + run_tests
2022-09-23T09:55:15.108077775Z + local -r max_parallel_tests=10
2022-09-23T09:55:15.108086241Z + bazel test --jobs=10 --local_test_jobs=10 --flaky_test_attempts=3 --action_env=INTERNAL_IP_SSH=true --test_output=errors --noshow_progress --noshow_loading_progress --test_arg=--image_version=2.0-ubuntu18 gpu:test_gpu :test_spark_rapids
2022-09-23T09:55:15.134654895Z Extracting Bazel installation...
2022-09-23T09:55:20.514324992Z Starting local Bazel server and connecting to it...
2022-09-23T09:55:30.876502149Z ERROR: no such target '//:test_spark_rapids': target 'test_spark_rapids' not declared in package '' defined by /init-actions/BUILD
2022-09-23T09:55:30.980045825Z INFO: Elapsed time: 15.678s
2022-09-23T09:55:30.982352741Z INFO: 0 processes.
2022-09-23T09:55:30.988402358Z ERROR: Couldn't start the build. Unable to run tests

Can you pls change the folder name to remove the dashes/hyphens or make the changes inline with how other projects with dashes and hyphens have been handled in the repo.

jayadeep-jayaraman avatar Sep 23 '22 11:09 jayadeep-jayaraman

Can you pls change the folder name to remove the dashes/hyphens or make the changes inline with how other projects with dashes and hyphens have been handled in the repo.

@jayadeep-jayaraman update to remove the dash in folder name, could you help build again?

nvliyuan avatar Sep 27 '22 14:09 nvliyuan

/gcbrun

pulkit-jain-G avatar Sep 29 '22 06:09 pulkit-jain-G

Build is still failing -

2022-09-29T06:49:05.006507921Z + TESTS_TO_RUN+=("${test_target}") 2022-09-29T06:49:05.006512991Z + for changed_dir in "${changed_dirs[@]}" 2022-09-29T06:49:05.006517941Z + local test_name=sparkRapids 2022-09-29T06:49:05.006522901Z + [[ sparkRapids == - ]] 2022-09-29T06:49:05.006555261Z + local test_target=sparkRapids:test_sparkRapids 2022-09-29T06:49:05.006563411Z + TESTS_TO_RUN+=("${test_target}") 2022-09-29T06:49:05.006568621Z + echo 'Tests: gpu:test_gpu sparkRapids:test_sparkRapids' 2022-09-29T06:49:05.006585691Z Changed directories: gpu/ sparkRapids/ 2022-09-29T06:49:05.009508694Z Tests: gpu:test_gpu sparkRapids:test_sparkRapids 2022-09-29T06:49:05.009562594Z + run_tests 2022-09-29T06:49:05.009600034Z + local -r max_parallel_tests=10 2022-09-29T06:49:05.009609194Z + bazel test --jobs=10 --local_test_jobs=10 --flaky_test_attempts=3 --action_env=INTERNAL_IP_SSH=true --test_output=errors --noshow_progress --noshow_loading_progress --test_arg=--image_version=1.5-ubuntu18 gpu:test_gpu sparkRapids:test_sparkRapids 2022-09-29T06:49:05.090936105Z Extracting Bazel installation... 2022-09-29T06:49:08.923553506Z Starting local Bazel server and connecting to it... 2022-09-29T06:49:17.946652515Z ERROR: Skipping 'sparkRapids:test_sparkRapids': no such target '//sparkRapids:test_sparkRapids': target 'test_sparkRapids' not declared in package 'sparkRapids' (did you mean 'test_spark_rapids'?) defined by /init-actions/sparkRapids/BUILD 2022-09-29T06:49:17.952661462Z ERROR: no such target '//sparkRapids:test_sparkRapids': target 'test_sparkRapids' not declared in package 'sparkRapids' (did you mean 'test_spark_rapids'?) defined by /init-actions/sparkRapids/BUILD 2022-09-29T06:49:18.052650713Z INFO: Elapsed time: 12.780s 2022-09-29T06:49:18.056012756Z INFO: 0 processes. 2022-09-29T06:49:18.061411242Z ERROR: Couldn't start the build. Unable to run tests

pulkit-jain-G avatar Sep 29 '22 12:09 pulkit-jain-G

@pulkit-jain-G update the test scripts names , could you help build again?

nvliyuan avatar Sep 30 '22 09:09 nvliyuan

/gcbrun

pulkit-jain-G avatar Sep 30 '22 09:09 pulkit-jain-G

@nvliyuan Build is failing.

2022-09-30T09:45:37.235009065Z + local test_target=sparkRapids:test_sparkRapids 2022-09-30T09:45:37.235015801Z + TESTS_TO_RUN+=("${test_target}") 2022-09-30T09:45:37.235022390Z + echo 'Tests: gpu:test_gpu sparkRapids:test_sparkRapids' 2022-09-30T09:45:37.235044571Z + run_tests 2022-09-30T09:45:37.235051716Z + local -r max_parallel_tests=10 2022-09-30T09:45:37.235063181Z + bazel test --jobs=10 --local_test_jobs=10 --flaky_test_attempts=3 --action_env=INTERNAL_IP_SSH=true --test_output=errors --noshow_progress --noshow_loading_progress --test_arg=--image_version=1.5-ubuntu18 gpu:test_gpu sparkRapids:test_sparkRapids 2022-09-30T09:45:37.235080367Z Changed directories: gpu/ sparkRapids/ 2022-09-30T09:45:37.235087582Z Tests: gpu:test_gpu sparkRapids:test_sparkRapids 2022-09-30T09:45:37.289810876Z Extracting Bazel installation... 2022-09-30T09:45:43.124843234Z Starting local Bazel server and connecting to it... 2022-09-30T09:46:24.775800229Z ERROR: /init-actions/sparkRapids/BUILD:5:8: in deps attribute of py_test rule //sparkRapids:test_sparkRapids: rule '//sparkRapids:verify_rapids' does not exist 2022-09-30T09:46:24.775860600Z ERROR: /init-actions/sparkRapids/BUILD:5:8: Analysis of target '//sparkRapids:test_sparkRapids' failed 2022-09-30T09:46:24.897789776Z ERROR: Analysis of target '//sparkRapids:test_sparkRapids' failed; build aborted: 2022-09-30T09:46:25.195195850Z INFO: Elapsed time: 47.553s 2022-09-30T09:46:25.240254648Z INFO: 0 processes. 2022-09-30T09:46:25.251067718Z ERROR: Couldn't start the build. Unable to run tests ++ date --iso-8601=seconds

  • LOGS_SINCE_TIME=2022-09-30T09:46:33+00:00

pulkit-jain-G avatar Sep 30 '22 09:09 pulkit-jain-G

@pulkit-jain-G removed verify_rapids in sparkRapids/BUILD scripts , could you help build again?

nvliyuan avatar Sep 30 '22 11:09 nvliyuan

/gcbrun

pulkit-jain-G avatar Sep 30 '22 13:09 pulkit-jain-G

@pulkit-jain-G seems the presubmit-pr still fails, could you help share the log?

nvliyuan avatar Oct 04 '22 07:10 nvliyuan

@nvliyuan Please find below error logs-

2022-10-04T06:16:01.010781053Z ---------------------------------------------------------------------- 2022-10-04T06:16:01.010794473Z Ran 1 test in 1011.798s 2022-10-04T06:16:01.010807634Z 2022-10-04T06:16:01.010820800Z FAILED (failures=1) 2022-10-04T06:16:01.010834500Z ================================================================================ 2022-10-04T06:16:01.010848261Z ==================== Test output for //gpu:test_gpu (shard 5 of 15): 2022-10-04T06:16:01.010862235Z Running tests under Python 3.8.10: /usr/bin/python3 2022-10-04T06:16:01.010876593Z [ RUN ] NvidiaGpuDriverTestCase.test_install_gpu_cuda_nvidia('STANDARD', ['m', 'w-0', 'w-1'], 'type=nvidia-tesla-v100', 'type=nvidia-tesla-v100', '11.0') 2022-10-04T06:16:01.010891005Z [ FAILED ] NvidiaGpuDriverTestCase.test_install_gpu_cuda_nvidia('STANDARD', ['m', 'w-0', 'w-1'], 'type=nvidia-tesla-v100', 'type=nvidia-tesla-v100', '11.0') 2022-10-04T06:16:01.010946289Z ====================================================================== 2022-10-04T06:16:01.010965330Z FAIL: test_install_gpu_cuda_nvidia('STANDARD', ['m', 'w-0', 'w-1'], 'type=nvidia-tesla-v100', 'type=nvidia-tesla-v100', '11.0') (main.NvidiaGpuDriverTestCase) 2022-10-04T06:16:01.010979865Z test_install_gpu_cuda_nvidia('STANDARD', ['m', 'w-0', 'w-1'], 'type=nvidia-tesla-v100', 'type=nvidia-tesla-v100', '11.0') (main.NvidiaGpuDriverTestCase) 2022-10-04T06:16:01.010994737Z test_install_gpu_cuda_nvidia('STANDARD', ['m', 'w-0', 'w-1'], 'type=nvidia-tesla-v100', 'type=nvidia-tesla-v100', '11.0') 2022-10-04T06:16:01.011009469Z ---------------------------------------------------------------------- 2022-10-04T06:16:01.011024017Z Traceback (most recent call last): 2022-10-04T06:16:01.011038723Z File "/home/ia-tests/.cache/bazel/_bazel_ia-tests/83b1ae36bb04ea5432b9efccee83c25f/execroot/main/bazel-out/k8-fastbuild/bin/gpu/test_gpu.runfiles/io_abseil_py/absl/testing/parameterized.py", line 265, in bound_param_test 2022-10-04T06:16:01.011053807Z test_method(self, *testcase_params) 2022-10-04T06:16:01.011068415Z File "/home/ia-tests/.cache/bazel/_bazel_ia-tests/83b1ae36bb04ea5432b9efccee83c25f/execroot/main/bazel-out/k8-fastbuild/bin/gpu/test_gpu.runfiles/main/gpu/test_gpu.py", line 120, in test_install_gpu_cuda_nvidia 2022-10-04T06:16:01.011082250Z self.createCluster( 2022-10-04T06:16:01.011095859Z File "/home/ia-tests/.cache/bazel/_bazel_ia-tests/83b1ae36bb04ea5432b9efccee83c25f/execroot/main/bazel-out/k8-fastbuild/bin/gpu/test_gpu.runfiles/main/integration_tests/dataproc_test_case.py", line 170, in createCluster 2022-10-04T06:16:01.011118744Z _, stdout, _ = self.assert_command( 2022-10-04T06:16:01.011134854Z File "/home/ia-tests/.cache/bazel/_bazel_ia-tests/83b1ae36bb04ea5432b9efccee83c25f/execroot/main/bazel-out/k8-fastbuild/bin/gpu/test_gpu.runfiles/main/integration_tests/dataproc_test_case.py", line 322, in assert_command 2022-10-04T06:16:01.011154233Z self.assertEqual( 2022-10-04T06:16:01.011169559Z AssertionError: 1 != 0 : Failed to execute command: 2022-10-04T06:16:01.011184417Z gcloud dataproc clusters create test-gpu-standard-1-5-20221004-060008-uy9c --num-masters=1 --num-workers=2 --image-version=1.5-rocky8 --initialization-actions='gs://dataproc-init-actions-test-cloud-dataproc-ci/20221004-060002-r5xl/gpu/install_gpu_driver.sh' --initialization-action-timeout=30m --metadata=gpu-driver-provider=NVIDIA,cuda-version=11.0 --master-accelerator=type=nvidia-tesla-v100 --worker-accelerator=type=nvidia-tesla-v100 --master-machine-type=n1-standard-2 --worker-machine-type=n1-standard-2 --master-boot-disk-size=50GB --worker-boot-disk-size=50GB --format=json --region=us-central1 --max-age=2h 2022-10-04T06:16:01.011199375Z STDOUT: 2022-10-04T06:16:01.011212255Z 2022-10-04T06:16:01.011225544Z STDERR: 2022-10-04T06:16:01.011239175Z Waiting on operation [projects/cloud-dataproc-ci/regions/us-central1/operations/dfb5e206-16b8-3846-be71-2b07aae9e609]. 2022-10-04T06:16:01.011252905Z Waiting for cluster creation operation... 2022-10-04T06:16:01.011266811Z WARNING: For PD-Standard without local SSDs, we strongly recommend provisioning 1TB or larger to ensure consistently high I/O performance. See https://cloud.google.com/compute/docs/disks/performance for information on disk I/O performance. 2022-10-04T06:16:01.011290400Z ...................................................................................................................................done. 2022-10-04T06:16:01.011331085Z ERROR: (gcloud.dataproc.clusters.create) Operation [projects/cloud-dataproc-ci/regions/us-central1/operations/dfb5e206-16b8-3846-be71-2b07aae9e609] failed: Multiple Errors: 2022-10-04T06:16:01.011346614Z - Initialization action failed. Failed action 'gs://dataproc-init-actions-test-cloud-dataproc-ci/20221004-060002-r5xl/gpu/install_gpu_driver.sh', see output in: gs://dataproc-108de5de-43c2-4a4b-979a-adebc15a58a8-us-central1/google-cloud-dataproc-metainfo/6386322e-0a41-4ab2-8dae-a61301bca951/test-gpu-standard-1-5-20221004-060008-uy9c-m/dataproc-initialization-script-0_output 2022-10-04T06:16:01.011360661Z - Initialization action failed. Failed action 'gs://dataproc-init-actions-test-cloud-dataproc-ci/20221004-060002-r5xl/gpu/install_gpu_driver.sh', see output in: gs://dataproc-108de5de-43c2-4a4b-979a-adebc15a58a8-us-central1/google-cloud-dataproc-metainfo/6386322e-0a41-4ab2-8dae-a61301bca951/test-gpu-standard-1-5-20221004-060008-uy9c-w-0/dataproc-initialization-script-0_output 2022-10-04T06:16:01.011374745Z - Initialization action failed. Failed action 'gs://dataproc-init-actions-test-cloud-dataproc-ci/20221004-060002-r5xl/gpu/install_gpu_driver.sh', see output in: gs://dataproc-108de5de-43c2-4a4b-979a-adebc15a58a8-us-central1/google-cloud-dataproc-metainfo/6386322e-0a41-4ab2-8dae-a61301bca951/test-gpu-standard-1-5-20221004-060008-uy9c-w-1/dataproc-initialization-script-0_output. 2022-10-04T06:16:01.011388371Z 2022-10-04T06:16:01.011402321Z 2022-10-04T06:16:01.011419361Z ---------------------------------------------------------------------- 2022-10-04T06:16:01.011434784Z Ran 1 test in 963.974s 2022-10-04T06:16:01.011448669Z 2022-10-04T06:16:01.011463174Z FAILED (failures=1) 2022-10-04T06:16:01.011477387Z ================================================================================

pulkit-jain-G avatar Oct 04 '22 08:10 pulkit-jain-G

I found below error in output file (gs://dataproc-108de5de-43c2-4a4b-979a-adebc15a58a8-us-central1/google-cloud-dataproc-metainfo/6386322e-0a41-4ab2-8dae-a61301bca951/test-gpu-standard-1-5-20221004-060008-uy9c-m/dataproc-initialization-script-0_output)

  • nvidia-smi -c EXCLUSIVE_PROCESS NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

pulkit-jain-G avatar Oct 04 '22 08:10 pulkit-jain-G

2022-10-04T06:16:01.011184417Z gcloud dataproc clusters create test-gpu-standard-1-5-20221004-060008-uy9c --num-masters=1 --num-workers=2 --image-version=1.5-rocky8 --initialization-actions='gs://dataproc-init-actions-test-cloud-dataproc-ci/20221004-060002-r5xl/gpu/install_gpu_driver.sh' --initialization-action-timeout=30m --metadata=gpu-driver-provider=NVIDIA,cuda-version=11.0 --master-accelerator=type=nvidia-tesla-v100 --worker-accelerator=type=nvidia-tesla-v100 --master-machine-type=n1-standard-2 --worker-machine-type=n1-standard-2 --master-boot-disk-size=50GB --worker-boot-disk-size=50GB --format=json --region=us-central1 --max-age=2h 2022-10-04T06:16:01.011199375Z STDOUT:

according to the log info if fails while creating cluster with rocky8 image, not sure whether it related to this issue. skip this image test to workaround. Could you help try again?

nvliyuan avatar Oct 08 '22 09:10 nvliyuan

@pulkit-jain-G could you help run again?

nvliyuan avatar Oct 11 '22 01:10 nvliyuan

/gcbrun

pulkit-jain-G avatar Oct 11 '22 03:10 pulkit-jain-G

@pulkit-jain-G could you help share the log?

nvliyuan avatar Oct 11 '22 06:10 nvliyuan

@nvliyuan

2022-10-11T04:35:30.368477920Z ================================================================================ 2022-10-11T04:35:30.368484865Z ==================== Test output for //gpu:test_gpu (shard 3 of 15): 2022-10-11T04:35:30.368500121Z Running tests under Python 3.8.10: /usr/bin/python3 2022-10-11T04:35:30.368508065Z [ RUN ] NvidiaGpuDriverTestCase.test_install_gpu_cuda_nvidia('SINGLE', ['m'], 'type=nvidia-tesla-v100', None, '10.1') 2022-10-11T04:35:30.368518827Z W1011 04:19:40.725619 140262669547328 dataproc_test_case.py:204] Skipping cluster delete: name is None 2022-10-11T04:35:30.368527824Z [ SKIPPED ] NvidiaGpuDriverTestCase.test_install_gpu_cuda_nvidia('SINGLE', ['m'], 'type=nvidia-tesla-v100', None, '10.1') 2022-10-11T04:35:30.368556447Z [ RUN ] NvidiaGpuDriverTestCase.test_install_gpu_without_agent('STANDARD', ['m'], 'type=nvidia-tesla-v100', None, 'NVIDIA') 2022-10-11T04:35:30.368564965Z [ FAILED ] NvidiaGpuDriverTestCase.test_install_gpu_without_agent('STANDARD', ['m'], 'type=nvidia-tesla-v100', None, 'NVIDIA') 2022-10-11T04:35:30.368572324Z ====================================================================== 2022-10-11T04:35:30.368580248Z FAIL: test_install_gpu_without_agent('STANDARD', ['m'], 'type=nvidia-tesla-v100', None, 'NVIDIA') (main.NvidiaGpuDriverTestCase) 2022-10-11T04:35:30.368590415Z test_install_gpu_without_agent('STANDARD', ['m'], 'type=nvidia-tesla-v100', None, 'NVIDIA') (main.NvidiaGpuDriverTestCase) 2022-10-11T04:35:30.368607182Z test_install_gpu_without_agent('STANDARD', ['m'], 'type=nvidia-tesla-v100', None, 'NVIDIA') 2022-10-11T04:35:30.368614401Z ---------------------------------------------------------------------- 2022-10-11T04:35:30.368640354Z Traceback (most recent call last): 2022-10-11T04:35:30.368649183Z File "/home/ia-tests/.cache/bazel/_bazel_ia-tests/83b1ae36bb04ea5432b9efccee83c25f/execroot/main/bazel-out/k8-fastbuild/bin/gpu/test_gpu.runfiles/io_abseil_py/absl/testing/parameterized.py", line 265, in bound_param_test 2022-10-11T04:35:30.368656812Z test_method(self, *testcase_params) 2022-10-11T04:35:30.368677395Z File "/home/ia-tests/.cache/bazel/_bazel_ia-tests/83b1ae36bb04ea5432b9efccee83c25f/execroot/main/bazel-out/k8-fastbuild/bin/gpu/test_gpu.runfiles/main/gpu/test_gpu.py", line 68, in test_install_gpu_without_agent 2022-10-11T04:35:30.368686576Z self.createCluster( 2022-10-11T04:35:30.368694147Z File "/home/ia-tests/.cache/bazel/_bazel_ia-tests/83b1ae36bb04ea5432b9efccee83c25f/execroot/main/bazel-out/k8-fastbuild/bin/gpu/test_gpu.runfiles/main/integration_tests/dataproc_test_case.py", line 170, in createCluster 2022-10-11T04:35:30.368701017Z _, stdout, _ = self.assert_command( 2022-10-11T04:35:30.368739606Z File "/home/ia-tests/.cache/bazel/_bazel_ia-tests/83b1ae36bb04ea5432b9efccee83c25f/execroot/main/bazel-out/k8-fastbuild/bin/gpu/test_gpu.runfiles/main/integration_tests/dataproc_test_case.py", line 322, in assert_command 2022-10-11T04:35:30.368748367Z self.assertEqual( 2022-10-11T04:35:30.368756245Z AssertionError: 1 != 0 : Failed to execute command: 2022-10-11T04:35:30.368763995Z gcloud dataproc clusters create test-gpu-standard-1-5-20221011-041940-04nm --num-masters=1 --num-workers=2 --image-version=1.5-rocky8 --initialization-actions='gs://dataproc-init-actions-test-cloud-dataproc-ci/20221011-041934-5hg0/gpu/install_gpu_driver.sh' --initialization-action-timeout=30m --metadata=install-gpu-agent=false,gpu-driver-provider=NVIDIA --master-accelerator=type=nvidia-tesla-v100 --master-machine-type=n1-standard-2 --worker-machine-type=n1-standard-2 --master-boot-disk-size=50GB --worker-boot-disk-size=50GB --format=json --region=us-central1 --max-age=2h 2022-10-11T04:35:30.368771793Z STDOUT: 2022-10-11T04:35:30.368778314Z 2022-10-11T04:35:30.368785516Z STDERR: 2022-10-11T04:35:30.368818326Z Waiting on operation [projects/cloud-dataproc-ci/regions/us-central1/operations/17754bda-de58-3b61-b3f2-3fc7dbafcf60]. 2022-10-11T04:35:30.368827995Z Waiting for cluster creation operation... 2022-10-11T04:35:30.368835744Z WARNING: For PD-Standard without local SSDs, we strongly recommend provisioning 1TB or larger to ensure consistently high I/O performance. See https://cloud.google.com/compute/docs/disks/performance for information on disk I/O performance. 2022-10-11T04:35:30.368862560Z ........................................................................................................................................................................................................done. 2022-10-11T04:35:30.368906772Z ERROR: (gcloud.dataproc.clusters.create) Operation [projects/cloud-dataproc-ci/regions/us-central1/operations/17754bda-de58-3b61-b3f2-3fc7dbafcf60] failed: Initialization action failed. Failed action 'gs://dataproc-init-actions-test-cloud-dataproc-ci/20221011-041934-5hg0/gpu/install_gpu_driver.sh', see output in: gs://dataproc-108de5de-43c2-4a4b-979a-adebc15a58a8-us-central1/google-cloud-dataproc-metainfo/5c4d9528-bdac-4c58-afb6-e9cf48477133/test-gpu-standard-1-5-20221011-041940-04nm-m/dataproc-initialization-script-0_output. 2022-10-11T04:35:30.368922043Z 2022-10-11T04:35:30.368929066Z 2022-10-11T04:35:30.368935827Z ---------------------------------------------------------------------- 2022-10-11T04:35:30.368942592Z Ran 2 tests in 962.169s 2022-10-11T04:35:30.368951864Z 2022-10-11T04:35:30.368960628Z FAILED (failures=1, skipped=1) 2022-10-11T04:35:30.368989237Z ================================================================================ 2022-10-11T04:49:28.508895047Z FAIL: //gpu:test_gpu (shard 2 of 15) (see /home/ia-tests/.cache/bazel/_bazel_ia-tests/83b1ae36bb04ea5432b9efccee83c25f/execroot/main/bazel-out/k8-fastbuild/testlogs/gpu/test_gpu/shard_2_of_15/test_attempts/attempt_2.log) ++ [[ 0 != 0 ]] ++ kubectl delete pods presubmit-1-5-rocky8-900db2f3-f0c1-4d47-936c-4da63844fe99

pulkit-jain-G avatar Oct 12 '22 03:10 pulkit-jain-G

2022-10-11T04:35:30.368484865Z ==================== Test output for //gpu:test_gpu (shard 3 of 15):

@pulkit-jain-G why the presubmit run test_gpu.py script? The test script should be sparkRapids/test_sparkRapids.py, how can I update it?

nvliyuan avatar Oct 12 '22 09:10 nvliyuan

2022-10-11T04:35:30.368484865Z ==================== Test output for //gpu:test_gpu (shard 3 of 15):

@pulkit-jain-G why the presubmit run test_gpu.py script? The test script should be sparkRapids/test_sparkRapids.py, how can I update it?

seems that if I update the scripts under GPU, the presubmit job will run test_gpu.py. Just disable all rocky8 image test. could you help build again?

nvliyuan avatar Oct 12 '22 09:10 nvliyuan

/gcbrun

pulkit-jain-G avatar Oct 12 '22 09:10 pulkit-jain-G

@pulkit-jain-G could you share the log?

nvliyuan avatar Oct 13 '22 01:10 nvliyuan