[rapids] removed spark tests, updated to a more recent rapids release
Tested with CUDA=11 and CUDA=12
I prefer this to #1218
/gcbrun
/gcbrun
/gcbrun
/gcbrun
Should we increase the machine type from n1-standard-4 to n1-standard-16
cuda11 has been manually tested with all versions. dataproc 2.0 images all pass the automated tests and can be assumed to work with cuda12 as well Trying cuda12 on 2.1 and 2.2 now.
/gcbrun
/gcbrun
tests are failing for
- 2.1-debian11
- 2.1-rocky8
- 2.1-ubuntu20
- 2.2-debian12
- 2.2-rocky9
- 2.2-ubuntu22
/gcbrun
/gcbrun
[edit: this was a misconfiguration in the systemd unit]
It looks like the dask infrastructure is out of date and I'll have to target 2023.12 instead.
root@cluster-1718310842-m:~# /opt/conda/miniconda3/envs/dask/bin/python /tmp/init/dask/verify_dask_standalone.py
/opt/conda/miniconda3/envs/dask/lib/python3.11/site-packages/distributed/client.py:1394: VersionMismatchWarning: Mismatched versions found
+-------------+----------------+----------------+---------+
| Package | Client | Scheduler | Workers |
+-------------+----------------+----------------+---------+
| dask | 2024.6.2 | 2023.12.1 | None |
| distributed | 2024.6.2 | 2023.12.1 | None |
| python | 3.11.9.final.0 | 3.11.8.final.0 | None |
| tornado | 6.4.1 | 6.3.3 | None |
+-------------+----------------+----------------+---------+
I also need to reduce the python abi to 3.10
/gcbrun
/gcbrun
/gcbrun
/gcbrun
/gcbrun
/gcbrun
/gcbrun
/gcbrun
/gcbrun
It looks like I need to exercise the single-node cluster use case. I haven't tried that before.
From the logs:
2024-08-11T02:55:21.807668750Z ==================== Test output for //dask:test_dask (shard 2 of 3):
2024-08-11T02:55:21.807748101Z Running tests under Python 3.8.10: /usr/bin/python3
2024-08-11T02:55:21.807759606Z [ RUN ] DaskTestCase.test_dask('STANDARD', ['m'], 'standalone')
2024-08-11T02:55:21.807768226Z [ FAILED ] DaskTestCase.test_dask('STANDARD', ['m'], 'standalone')
2024-08-11T02:55:21.807800998Z ======================================================================
2024-08-11T02:55:21.807807908Z FAIL: test_dask('STANDARD', ['m'], 'standalone') (__main__.DaskTestCase)
2024-08-11T02:55:21.807814792Z test_dask('STANDARD', ['m'], 'standalone') (__main__.DaskTestCase)
2024-08-11T02:55:21.807823484Z test_dask('STANDARD', ['m'], 'standalone')
2024-08-11T02:55:21.807831165Z ----------------------------------------------------------------------
2024-08-11T02:55:21.807837812Z Traceback (most recent call last):
2024-08-11T02:55:21.807845530Z File "/home/ia-tests/.cache/bazel/_bazel_ia-tests/83b1ae36bb04ea5432b9efccee83c25f/execroot/_main/bazel-out/k8-fastbuild/bin/dask/test_dask.runfiles/io_abseil_py/absl/testing/parameterized.py", line 265, in bound_param_test
2024-08-11T02:55:21.807854826Z test_method(self, *testcase_params)
2024-08-11T02:55:21.807862703Z File "/home/ia-tests/.cache/bazel/_bazel_ia-tests/83b1ae36bb04ea5432b9efccee83c25f/execroot/_main/bazel-out/k8-fastbuild/bin/dask/test_dask.runfiles/_main/dask/test_dask.py", line 54, in test_dask
2024-08-11T02:55:21.807869501Z self.verify_dask_standalone(name)
2024-08-11T02:55:21.807877527Z File "/home/ia-tests/.cache/bazel/_bazel_ia-tests/83b1ae36bb04ea5432b9efccee83c25f/execroot/_main/bazel-out/k8-fastbuild/bin/dask/test_dask.runfiles/_main/dask/test_dask.py", line 20, in verify_dask_standalone
2024-08-11T02:55:21.807884861Z self._run_dask_test_script(name, self.DASK_STANDALONE_TEST_SCRIPT)
2024-08-11T02:55:21.807891864Z File "/home/ia-tests/.cache/bazel/_bazel_ia-tests/83b1ae36bb04ea5432b9efccee83c25f/execroot/_main/bazel-out/k8-fastbuild/bin/dask/test_dask.runfiles/_main/dask/test_dask.py", line 28, in _run_dask_test_script
2024-08-11T02:55:21.807899162Z self.assert_instance_command(name, verify_cmd)
2024-08-11T02:55:21.807905902Z File "/home/ia-tests/.cache/bazel/_bazel_ia-tests/83b1ae36bb04ea5432b9efccee83c25f/execroot/_main/bazel-out/k8-fastbuild/bin/dask/test_dask.runfiles/_main/integration_tests/dataproc_test_case.py", line 285, in assert_instance_command
2024-08-11T02:55:21.807912649Z ret_code, stdout, stderr = self.assert_command(
2024-08-11T02:55:21.807938970Z File "/home/ia-tests/.cache/bazel/_bazel_ia-tests/83b1ae36bb04ea5432b9efccee83c25f/execroot/_main/bazel-out/k8-fastbuild/bin/dask/test_dask.runfiles/_main/integration_tests/dataproc_test_case.py", line 337, in assert_command
2024-08-11T02:55:21.807949407Z self.assertEqual(
2024-08-11T02:55:21.807956462Z AssertionError: 1 != 0 : Failed to execute command:
2024-08-11T02:55:21.807964553Z gcloud compute ssh test-dask-standard-2-0-20240811-024024-osze-m --zone=us-central1-c --command="/opt/conda/miniconda3/envs/dask/bin/python verify_dask_standalone.py"
2024-08-11T02:55:21.807971149Z STDOUT:
2024-08-11T02:55:21.807977593Z
2024-08-11T02:55:21.807984800Z STDERR:
2024-08-11T02:55:21.807991850Z ConnectionRefusedError: [Errno 111] Connection refused
2024-08-11T02:55:21.807997940Z
2024-08-11T02:55:21.808011590Z The above exception was the direct cause of the following exception:
2024-08-11T02:55:21.808018182Z
2024-08-11T02:55:21.808025342Z Traceback (most recent call last):
2024-08-11T02:55:21.808040488Z File "/opt/conda/miniconda3/envs/dask/lib/python3.10/site-packages/distributed/comm/core.py", line 342, in connect
2024-08-11T02:55:21.808048314Z comm = await wait_for(
2024-08-11T02:55:21.808055258Z File "/opt/conda/miniconda3/envs/dask/lib/python3.10/site-packages/distributed/utils.py", line 1961, in wait_for
2024-08-11T02:55:21.808061892Z return await asyncio.wait_for(fut, timeout)
2024-08-11T02:55:21.808069882Z File "/opt/conda/miniconda3/envs/dask/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
2024-08-11T02:55:21.808076654Z return fut.result()
2024-08-11T02:55:21.808094253Z File "/opt/conda/miniconda3/envs/dask/lib/python3.10/site-packages/distributed/comm/tcp.py", line 559, in connect
2024-08-11T02:55:21.808101301Z convert_stream_closed_error(self, e)
2024-08-11T02:55:21.808109339Z File "/opt/conda/miniconda3/envs/dask/lib/python3.10/site-packages/distributed/comm/tcp.py", line 140, in convert_stream_closed_error
2024-08-11T02:55:21.808116472Z raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
2024-08-11T02:55:21.808124371Z distributed.comm.core.CommClosedError: in <distributed.comm.tcp.TCPConnector object at 0x7f7b61da2ad0>: ConnectionRefusedError: [Errno 111] Connection refused
2024-08-11T02:55:21.808131015Z
2024-08-11T02:55:21.808138238Z The above exception was the direct cause of the following exception:
2024-08-11T02:55:21.808144972Z
2024-08-11T02:55:21.808152623Z Traceback (most recent call last):
2024-08-11T02:55:21.808160140Z File "/home/ia-tests/verify_dask_standalone.py", line 8, in <module>
2024-08-11T02:55:21.808167162Z client = Client("localhost:8786")
2024-08-11T02:55:21.808174066Z File "/opt/conda/miniconda3/envs/dask/lib/python3.10/site-packages/distributed/client.py", line 1018, in __init__
2024-08-11T02:55:21.808181030Z self.start(timeout=timeout)
2024-08-11T02:55:21.808189142Z File "/opt/conda/miniconda3/envs/dask/lib/python3.10/site-packages/distributed/client.py", line 1220, in start
2024-08-11T02:55:21.808196282Z sync(self.loop, self._start, **kwargs)
2024-08-11T02:55:21.808211894Z File "/opt/conda/miniconda3/envs/dask/lib/python3.10/site-packages/distributed/utils.py", line 434, in sync
2024-08-11T02:55:21.808219326Z raise error
2024-08-11T02:55:21.808226280Z File "/opt/conda/miniconda3/envs/dask/lib/python3.10/site-packages/distributed/utils.py", line 408, in f
2024-08-11T02:55:21.808233318Z result = yield future
2024-08-11T02:55:21.808240475Z File "/opt/conda/miniconda3/envs/dask/lib/python3.10/site-packages/tornado/gen.py", line 766, in run
2024-08-11T02:55:21.808247774Z value = future.result()
2024-08-11T02:55:21.808254804Z File "/opt/conda/miniconda3/envs/dask/lib/python3.10/site-packages/distributed/client.py", line 1299, in _start
2024-08-11T02:55:21.808262278Z await self._ensure_connected(timeout=timeout)
2024-08-11T02:55:21.808269083Z File "/opt/conda/miniconda3/envs/dask/lib/python3.10/site-packages/distributed/client.py", line 1361, in _ensure_connected
2024-08-11T02:55:21.808276435Z comm = await connect(
2024-08-11T02:55:21.808283313Z File "/opt/conda/miniconda3/envs/dask/lib/python3.10/site-packages/distributed/comm/core.py", line 368, in connect
2024-08-11T02:55:21.808290749Z raise OSError(
2024-08-11T02:55:21.808298079Z OSError: Timed out trying to connect to tcp://localhost:8786 after 30 s
2024-08-11T02:55:21.808304335Z
2024-08-11T02:55:21.808310463Z
2024-08-11T02:55:21.808316315Z ----------------------------------------------------------------------
2024-08-11T02:55:21.808322630Z Ran 1 test in 594.283s
2024-08-11T02:55:21.808328780Z
2024-08-11T02:55:21.808335329Z FAILED (failures=1)
2024-08-11T02:55:21.808341831Z ================================================================================
Digging in, I see this:
root@cluster-1718310842-m:~# /opt/conda/miniconda3/envs/dask/bin/python /tmp/init/dask/verify_dask_standalone.py
/opt/conda/miniconda3/envs/dask/lib/python3.9/site-packages/distributed/client.py:1309: VersionMismatchWarning: Mismatched versions found
+-------------+----------------+----------------+---------+
| Package | Client | Scheduler | Workers |
+-------------+----------------+----------------+---------+
| dask | 2022.8.1 | 2022.9.2 | None |
| distributed | 2022.8.1 | 2022.9.2 | None |
| python | 3.9.19.final.0 | 3.10.8.final.0 | None |
+-------------+----------------+----------------+---------+
warnings.warn(version_module.VersionMismatchWarning(msg[0]["warning"]))
that there is a bug I fixed in dask standalone but not in dask + rapids
/gcbrun
/gcbrun
/gcbrun
/gcbrun