initialization-actions icon indicating copy to clipboard operation
initialization-actions copied to clipboard

[rapids] removed spark tests, updated to a more recent rapids release

Open cjac opened this issue 1 year ago • 117 comments

Tested with CUDA=11 and CUDA=12

cjac avatar Aug 08 '24 16:08 cjac

I prefer this to #1218

cjac avatar Aug 08 '24 16:08 cjac

/gcbrun

cjac avatar Aug 09 '24 01:08 cjac

/gcbrun

cjac avatar Aug 09 '24 04:08 cjac

/gcbrun

cjac avatar Aug 09 '24 04:08 cjac

/gcbrun

prince-cs avatar Aug 09 '24 05:08 prince-cs

Should we increase the machine type from n1-standard-4 to n1-standard-16

prince-cs avatar Aug 09 '24 05:08 prince-cs

cuda11 has been manually tested with all versions. dataproc 2.0 images all pass the automated tests and can be assumed to work with cuda12 as well Trying cuda12 on 2.1 and 2.2 now.

cjac avatar Aug 09 '24 05:08 cjac

/gcbrun

cjac avatar Aug 09 '24 06:08 cjac

/gcbrun

cjac avatar Aug 09 '24 15:08 cjac

tests are failing for

  • 2.1-debian11
  • 2.1-rocky8
  • 2.1-ubuntu20
  • 2.2-debian12
  • 2.2-rocky9
  • 2.2-ubuntu22

cjac avatar Aug 09 '24 16:08 cjac

/gcbrun

cjac avatar Aug 09 '24 22:08 cjac

/gcbrun

cjac avatar Aug 10 '24 00:08 cjac

[edit: this was a misconfiguration in the systemd unit]

It looks like the dask infrastructure is out of date and I'll have to target 2023.12 instead.

root@cluster-1718310842-m:~# /opt/conda/miniconda3/envs/dask/bin/python /tmp/init/dask/verify_dask_standalone.py 
/opt/conda/miniconda3/envs/dask/lib/python3.11/site-packages/distributed/client.py:1394: VersionMismatchWarning: Mismatched versions found

+-------------+----------------+----------------+---------+
| Package     | Client         | Scheduler      | Workers |
+-------------+----------------+----------------+---------+
| dask        | 2024.6.2       | 2023.12.1      | None    |
| distributed | 2024.6.2       | 2023.12.1      | None    |
| python      | 3.11.9.final.0 | 3.11.8.final.0 | None    |
| tornado     | 6.4.1          | 6.3.3          | None    |
+-------------+----------------+----------------+---------+

cjac avatar Aug 10 '24 02:08 cjac

I also need to reduce the python abi to 3.10

cjac avatar Aug 10 '24 02:08 cjac

/gcbrun

cjac avatar Aug 10 '24 04:08 cjac

/gcbrun

cjac avatar Aug 10 '24 20:08 cjac

/gcbrun

cjac avatar Aug 10 '24 23:08 cjac

/gcbrun

cjac avatar Aug 11 '24 00:08 cjac

/gcbrun

cjac avatar Aug 11 '24 01:08 cjac

/gcbrun

cjac avatar Aug 11 '24 01:08 cjac

/gcbrun

cjac avatar Aug 11 '24 01:08 cjac

/gcbrun

cjac avatar Aug 11 '24 02:08 cjac

/gcbrun

cjac avatar Aug 11 '24 02:08 cjac

It looks like I need to exercise the single-node cluster use case. I haven't tried that before.

From the logs:

2024-08-11T02:55:21.807668750Z ==================== Test output for //dask:test_dask (shard 2 of 3):
2024-08-11T02:55:21.807748101Z Running tests under Python 3.8.10: /usr/bin/python3
2024-08-11T02:55:21.807759606Z [ RUN      ] DaskTestCase.test_dask('STANDARD', ['m'], 'standalone')
2024-08-11T02:55:21.807768226Z [  FAILED  ] DaskTestCase.test_dask('STANDARD', ['m'], 'standalone')
2024-08-11T02:55:21.807800998Z ======================================================================
2024-08-11T02:55:21.807807908Z FAIL: test_dask('STANDARD', ['m'], 'standalone') (__main__.DaskTestCase)
2024-08-11T02:55:21.807814792Z test_dask('STANDARD', ['m'], 'standalone') (__main__.DaskTestCase)
2024-08-11T02:55:21.807823484Z test_dask('STANDARD', ['m'], 'standalone')
2024-08-11T02:55:21.807831165Z ----------------------------------------------------------------------
2024-08-11T02:55:21.807837812Z Traceback (most recent call last):
2024-08-11T02:55:21.807845530Z   File "/home/ia-tests/.cache/bazel/_bazel_ia-tests/83b1ae36bb04ea5432b9efccee83c25f/execroot/_main/bazel-out/k8-fastbuild/bin/dask/test_dask.runfiles/io_abseil_py/absl/testing/parameterized.py", line 265, in bound_param_test
2024-08-11T02:55:21.807854826Z     test_method(self, *testcase_params)
2024-08-11T02:55:21.807862703Z   File "/home/ia-tests/.cache/bazel/_bazel_ia-tests/83b1ae36bb04ea5432b9efccee83c25f/execroot/_main/bazel-out/k8-fastbuild/bin/dask/test_dask.runfiles/_main/dask/test_dask.py", line 54, in test_dask
2024-08-11T02:55:21.807869501Z     self.verify_dask_standalone(name)
2024-08-11T02:55:21.807877527Z   File "/home/ia-tests/.cache/bazel/_bazel_ia-tests/83b1ae36bb04ea5432b9efccee83c25f/execroot/_main/bazel-out/k8-fastbuild/bin/dask/test_dask.runfiles/_main/dask/test_dask.py", line 20, in verify_dask_standalone
2024-08-11T02:55:21.807884861Z     self._run_dask_test_script(name, self.DASK_STANDALONE_TEST_SCRIPT)
2024-08-11T02:55:21.807891864Z   File "/home/ia-tests/.cache/bazel/_bazel_ia-tests/83b1ae36bb04ea5432b9efccee83c25f/execroot/_main/bazel-out/k8-fastbuild/bin/dask/test_dask.runfiles/_main/dask/test_dask.py", line 28, in _run_dask_test_script
2024-08-11T02:55:21.807899162Z     self.assert_instance_command(name, verify_cmd)
2024-08-11T02:55:21.807905902Z   File "/home/ia-tests/.cache/bazel/_bazel_ia-tests/83b1ae36bb04ea5432b9efccee83c25f/execroot/_main/bazel-out/k8-fastbuild/bin/dask/test_dask.runfiles/_main/integration_tests/dataproc_test_case.py", line 285, in assert_instance_command
2024-08-11T02:55:21.807912649Z     ret_code, stdout, stderr = self.assert_command(
2024-08-11T02:55:21.807938970Z   File "/home/ia-tests/.cache/bazel/_bazel_ia-tests/83b1ae36bb04ea5432b9efccee83c25f/execroot/_main/bazel-out/k8-fastbuild/bin/dask/test_dask.runfiles/_main/integration_tests/dataproc_test_case.py", line 337, in assert_command
2024-08-11T02:55:21.807949407Z     self.assertEqual(
2024-08-11T02:55:21.807956462Z AssertionError: 1 != 0 : Failed to execute command:
2024-08-11T02:55:21.807964553Z gcloud compute ssh test-dask-standard-2-0-20240811-024024-osze-m --zone=us-central1-c --command="/opt/conda/miniconda3/envs/dask/bin/python verify_dask_standalone.py"
2024-08-11T02:55:21.807971149Z STDOUT:
2024-08-11T02:55:21.807977593Z 
2024-08-11T02:55:21.807984800Z STDERR:
2024-08-11T02:55:21.807991850Z ConnectionRefusedError: [Errno 111] Connection refused
2024-08-11T02:55:21.807997940Z 
2024-08-11T02:55:21.808011590Z The above exception was the direct cause of the following exception:
2024-08-11T02:55:21.808018182Z 
2024-08-11T02:55:21.808025342Z Traceback (most recent call last):
2024-08-11T02:55:21.808040488Z   File "/opt/conda/miniconda3/envs/dask/lib/python3.10/site-packages/distributed/comm/core.py", line 342, in connect
2024-08-11T02:55:21.808048314Z     comm = await wait_for(
2024-08-11T02:55:21.808055258Z   File "/opt/conda/miniconda3/envs/dask/lib/python3.10/site-packages/distributed/utils.py", line 1961, in wait_for
2024-08-11T02:55:21.808061892Z     return await asyncio.wait_for(fut, timeout)
2024-08-11T02:55:21.808069882Z   File "/opt/conda/miniconda3/envs/dask/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
2024-08-11T02:55:21.808076654Z     return fut.result()
2024-08-11T02:55:21.808094253Z   File "/opt/conda/miniconda3/envs/dask/lib/python3.10/site-packages/distributed/comm/tcp.py", line 559, in connect
2024-08-11T02:55:21.808101301Z     convert_stream_closed_error(self, e)
2024-08-11T02:55:21.808109339Z   File "/opt/conda/miniconda3/envs/dask/lib/python3.10/site-packages/distributed/comm/tcp.py", line 140, in convert_stream_closed_error
2024-08-11T02:55:21.808116472Z     raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
2024-08-11T02:55:21.808124371Z distributed.comm.core.CommClosedError: in <distributed.comm.tcp.TCPConnector object at 0x7f7b61da2ad0>: ConnectionRefusedError: [Errno 111] Connection refused
2024-08-11T02:55:21.808131015Z 
2024-08-11T02:55:21.808138238Z The above exception was the direct cause of the following exception:
2024-08-11T02:55:21.808144972Z 
2024-08-11T02:55:21.808152623Z Traceback (most recent call last):
2024-08-11T02:55:21.808160140Z   File "/home/ia-tests/verify_dask_standalone.py", line 8, in <module>
2024-08-11T02:55:21.808167162Z     client = Client("localhost:8786")
2024-08-11T02:55:21.808174066Z   File "/opt/conda/miniconda3/envs/dask/lib/python3.10/site-packages/distributed/client.py", line 1018, in __init__
2024-08-11T02:55:21.808181030Z     self.start(timeout=timeout)
2024-08-11T02:55:21.808189142Z   File "/opt/conda/miniconda3/envs/dask/lib/python3.10/site-packages/distributed/client.py", line 1220, in start
2024-08-11T02:55:21.808196282Z     sync(self.loop, self._start, **kwargs)
2024-08-11T02:55:21.808211894Z   File "/opt/conda/miniconda3/envs/dask/lib/python3.10/site-packages/distributed/utils.py", line 434, in sync
2024-08-11T02:55:21.808219326Z     raise error
2024-08-11T02:55:21.808226280Z   File "/opt/conda/miniconda3/envs/dask/lib/python3.10/site-packages/distributed/utils.py", line 408, in f
2024-08-11T02:55:21.808233318Z     result = yield future
2024-08-11T02:55:21.808240475Z   File "/opt/conda/miniconda3/envs/dask/lib/python3.10/site-packages/tornado/gen.py", line 766, in run
2024-08-11T02:55:21.808247774Z     value = future.result()
2024-08-11T02:55:21.808254804Z   File "/opt/conda/miniconda3/envs/dask/lib/python3.10/site-packages/distributed/client.py", line 1299, in _start
2024-08-11T02:55:21.808262278Z     await self._ensure_connected(timeout=timeout)
2024-08-11T02:55:21.808269083Z   File "/opt/conda/miniconda3/envs/dask/lib/python3.10/site-packages/distributed/client.py", line 1361, in _ensure_connected
2024-08-11T02:55:21.808276435Z     comm = await connect(
2024-08-11T02:55:21.808283313Z   File "/opt/conda/miniconda3/envs/dask/lib/python3.10/site-packages/distributed/comm/core.py", line 368, in connect
2024-08-11T02:55:21.808290749Z     raise OSError(
2024-08-11T02:55:21.808298079Z OSError: Timed out trying to connect to tcp://localhost:8786 after 30 s
2024-08-11T02:55:21.808304335Z 
2024-08-11T02:55:21.808310463Z 
2024-08-11T02:55:21.808316315Z ----------------------------------------------------------------------
2024-08-11T02:55:21.808322630Z Ran 1 test in 594.283s
2024-08-11T02:55:21.808328780Z 
2024-08-11T02:55:21.808335329Z FAILED (failures=1)
2024-08-11T02:55:21.808341831Z ================================================================================

cjac avatar Aug 11 '24 03:08 cjac

Digging in, I see this:

root@cluster-1718310842-m:~# /opt/conda/miniconda3/envs/dask/bin/python /tmp/init/dask/verify_dask_standalone.py 
/opt/conda/miniconda3/envs/dask/lib/python3.9/site-packages/distributed/client.py:1309: VersionMismatchWarning: Mismatched versions found

+-------------+----------------+----------------+---------+
| Package     | Client         | Scheduler      | Workers |
+-------------+----------------+----------------+---------+
| dask        | 2022.8.1       | 2022.9.2       | None    |
| distributed | 2022.8.1       | 2022.9.2       | None    |
| python      | 3.9.19.final.0 | 3.10.8.final.0 | None    |
+-------------+----------------+----------------+---------+
  warnings.warn(version_module.VersionMismatchWarning(msg[0]["warning"]))

cjac avatar Aug 11 '24 03:08 cjac

that there is a bug I fixed in dask standalone but not in dask + rapids

cjac avatar Aug 11 '24 04:08 cjac

/gcbrun

cjac avatar Aug 12 '24 21:08 cjac

/gcbrun

cjac avatar Aug 13 '24 00:08 cjac

/gcbrun

cjac avatar Aug 14 '24 03:08 cjac

/gcbrun

cjac avatar Aug 14 '24 04:08 cjac