update rapids version for 24.10 release
This pr is to update the spark-rapids script version to 24.10.0 and update the readme doc
@viadea please help review. CC @jayadeep-jayaraman @cjac
Oh hey, thanks for the ping. I'll check it out.
/gcbrun
Hi @cjac , can we merge this pr?
Hi @cjac , any update?
I apologize for the delay here.
I'm caught up behind adding installation from local disk as an option to rapids/rapids.sh ; I had begun seeing weekly cdn related build failures, so I'm bringing the packages closer to the cluster to improve ci/cd test performance.
Unfortunately, conda does not presently install directly from direct attached media, opting instead to copy the packages to an intermediate temp directory before unpacking.
https://github.com/conda/conda/issues/14377
If movement on spark-rapids is urgent and merits putting down dask rapids instead of finishing it, putting it down, and moving on to spark-rapids work, I may be able to switch context. I prefer to finish the other first, but if nv wants to see a new version of spark-rapids before the middle of December, then let me know and I'll switch tracks for a bit.
My current estimate for completion of dask-rapids work is later this week. Then I will take a look at spark-rapids/ for the first time since it got its own directory.
C.J.
Hi @cjac , not sure that can we merge this pr now?
I haven't tested it thoroughly yet. I've been caught up in refactoring shared code into templates. I would like to generate this file from components rather than copy/pasting between scripts.
Can you let me know what you think of #1282 please?
Let's merge the pr for now, it is just a version update, thx
let me try it in my environment...
running without cuda-version specified produces a request for:
- https://download.nvidia.com/XFree86/Linux-x86_64/530.30.02/NVIDIA-Linux-x86_64-530.30.02.run
- https://developer.download.nvidia.com/compute/cuda/12.1.1/local_installers/cuda_12.1.1_530.30.02_linux.run
incorrectly specifying rapids-runtime as --metadata rapids-runtime="RAPIDS" produces usable error message.
@nvliyuan - do you mind if I commit to your branch?
please feel free to commit, thanks
a re-re-re-run took 4m35.536s to complete ; This looks good to me. Let me run it through the automated tests.
It looks like it built the kernel more than once.
NG code in templates/spark-rapids/ caches builds to GCS after the first run completes so subsequent similar runs will have less work to do.
[1] https://github.com/LLC-Technologies-Collier/initialization-actions/tree/template-gpu-20241219/templates/spark-rapids
/gcbrun
there's a known problem with our build system. Un momento por favor.
/gcbrun
/gcbrun
failure on 2.2-rocky9 ; I'll spin that up in my env.
oof. I forgot DKMS took so long to run. I think the test is timing out for waiting on dnf -y -q module install nvidia-driver:latest-dkms maybe
dnf -y -q install cuda-toolkit is taking a long time, too
the run takes 14m9.444s on rocky9
# echo $?
0
/gcbrun
[edited to add: I was incorrect to assume that nvliyuan/initialization-actions' master tracks GoogleCloudDataproc/initialization-actions' master]
I'm sorry, I seem to have done something to the commit history here. The diffstat looks very wrong at this point.
$ git diff master | diffstat
CONTRIBUTING.md | 20 ++---
cloudbuild/Dockerfile | 22 +++++-
cloudbuild/presubmit.sh | 1
cloudbuild/run-presubmit-on-k8s.sh | 34 +++++++--
dask/dask.sh | 213 ++++++++++++++++++++++++++++++++++++++++++++--------------
dask/test_dask.py | 14 +++
gpu/Dockerfile | 40 +++++++++++
gpu/README.md | 28 +++----
gpu/bazel.screenrc | 11 +++
gpu/env.json.sample | 7 +
gpu/install_gpu_driver.sh | 652 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------------------------------------------------------
gpu/manual-test-runner.sh | 77 +++++++++++++++++++++
gpu/run-bazel-tests.sh | 24 ++++++
gpu/test_gpu.py | 167 ++++++++++++++++++++++++++++++++-------------
h2o/sample-script.py | 11 ---
horovod/horovod.sh | 4 -
horovod/test_horovod.py | 13 ++-
hue/README.md | 118 ++++++++++++++++++++++++++++++++
hue/another-query.png |binary
hue/create-hive-table.png |binary
hue/hue-ui.png |binary
hue/simple-hiveql.png |binary
integration_tests/dataproc_test_case.py | 18 +++-
rapids/BUILD | 2
rapids/Dockerfile | 40 +++++++++++
rapids/bazel.screenrc | 17 ++++
rapids/env.json.sample | 7 +
rapids/manual-test-runner.sh | 77 +++++++++++++++++++++
rapids/rapids.sh | 814 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----------------------------------------------------------
rapids/run-bazel-tests.sh | 23 ++++++
rapids/test_rapids.py | 137 +++++++++++--------------------------
rapids/verify_rapids_dask.py | 19 -----
rapids/verify_rapids_dask_yarn.py | 19 +++++
spark-rapids/README.md | 19 -----
spark-rapids/spark-rapids.sh | 14 ++-
35 files changed, 1943 insertions(+), 719 deletions(-)
my apologies. ambiguous use of 'master' here. When diffed against origin master's commit, 169e98e424d87833e10b7761130f08bfde4b4815 I see what I expect.
/gcbrun
many tests have passed. still standing by for full green run.
okay, that looks good.