initialization-actions icon indicating copy to clipboard operation
initialization-actions copied to clipboard

update rapids version for 24.10 release

Open nvliyuan opened this issue 1 year ago • 3 comments

This pr is to update the spark-rapids script version to 24.10.0 and update the readme doc

nvliyuan avatar Oct 23 '24 09:10 nvliyuan

@viadea please help review. CC @jayadeep-jayaraman @cjac

nvliyuan avatar Oct 24 '24 02:10 nvliyuan

Oh hey, thanks for the ping. I'll check it out.

cjac avatar Oct 24 '24 03:10 cjac

/gcbrun

cjac avatar Oct 24 '24 03:10 cjac

Hi @cjac , can we merge this pr?

nvliyuan avatar Nov 05 '24 07:11 nvliyuan

Hi @cjac , any update?

nvliyuan avatar Nov 11 '24 01:11 nvliyuan

I apologize for the delay here.

I'm caught up behind adding installation from local disk as an option to rapids/rapids.sh ; I had begun seeing weekly cdn related build failures, so I'm bringing the packages closer to the cluster to improve ci/cd test performance.

Unfortunately, conda does not presently install directly from direct attached media, opting instead to copy the packages to an intermediate temp directory before unpacking.

https://github.com/conda/conda/issues/14377

If movement on spark-rapids is urgent and merits putting down dask rapids instead of finishing it, putting it down, and moving on to spark-rapids work, I may be able to switch context. I prefer to finish the other first, but if nv wants to see a new version of spark-rapids before the middle of December, then let me know and I'll switch tracks for a bit.

My current estimate for completion of dask-rapids work is later this week. Then I will take a look at spark-rapids/ for the first time since it got its own directory.

C.J.

cjac avatar Nov 11 '24 06:11 cjac

Hi @cjac , not sure that can we merge this pr now?

nvliyuan avatar Dec 24 '24 06:12 nvliyuan

I haven't tested it thoroughly yet. I've been caught up in refactoring shared code into templates. I would like to generate this file from components rather than copy/pasting between scripts.

Can you let me know what you think of #1282 please?

cjac avatar Dec 24 '24 23:12 cjac

Let's merge the pr for now, it is just a version update, thx

nvliyuan avatar Dec 25 '24 01:12 nvliyuan

let me try it in my environment...

cjac avatar Dec 25 '24 02:12 cjac

running without cuda-version specified produces a request for:

  • https://download.nvidia.com/XFree86/Linux-x86_64/530.30.02/NVIDIA-Linux-x86_64-530.30.02.run
  • https://developer.download.nvidia.com/compute/cuda/12.1.1/local_installers/cuda_12.1.1_530.30.02_linux.run

cjac avatar Dec 25 '24 02:12 cjac

incorrectly specifying rapids-runtime as --metadata rapids-runtime="RAPIDS" produces usable error message.

cjac avatar Dec 25 '24 02:12 cjac

@nvliyuan - do you mind if I commit to your branch?

cjac avatar Dec 25 '24 02:12 cjac

please feel free to commit, thanks

nvliyuan avatar Dec 25 '24 02:12 nvliyuan

a re-re-re-run took 4m35.536s to complete ; This looks good to me. Let me run it through the automated tests.

It looks like it built the kernel more than once.

NG code in templates/spark-rapids/ caches builds to GCS after the first run completes so subsequent similar runs will have less work to do.

[1] https://github.com/LLC-Technologies-Collier/initialization-actions/tree/template-gpu-20241219/templates/spark-rapids

cjac avatar Dec 25 '24 03:12 cjac

/gcbrun

cjac avatar Dec 25 '24 03:12 cjac

there's a known problem with our build system. Un momento por favor.

cjac avatar Dec 25 '24 03:12 cjac

/gcbrun

cjac avatar Dec 25 '24 03:12 cjac

/gcbrun

cjac avatar Dec 25 '24 03:12 cjac

failure on 2.2-rocky9 ; I'll spin that up in my env.

cjac avatar Dec 25 '24 03:12 cjac

oof. I forgot DKMS took so long to run. I think the test is timing out for waiting on dnf -y -q module install nvidia-driver:latest-dkms maybe

cjac avatar Dec 25 '24 03:12 cjac

dnf -y -q install cuda-toolkit is taking a long time, too

cjac avatar Dec 25 '24 03:12 cjac

the run takes 14m9.444s on rocky9

# echo $?
0

cjac avatar Dec 25 '24 03:12 cjac

/gcbrun

cjac avatar Dec 25 '24 03:12 cjac

[edited to add: I was incorrect to assume that nvliyuan/initialization-actions' master tracks GoogleCloudDataproc/initialization-actions' master]

I'm sorry, I seem to have done something to the commit history here. The diffstat looks very wrong at this point.

$ git diff master | diffstat
 CONTRIBUTING.md                         |   20 ++---
 cloudbuild/Dockerfile                   |   22 +++++-
 cloudbuild/presubmit.sh                 |    1 
 cloudbuild/run-presubmit-on-k8s.sh      |   34 +++++++--
 dask/dask.sh                            |  213 ++++++++++++++++++++++++++++++++++++++++++++--------------
 dask/test_dask.py                       |   14 +++
 gpu/Dockerfile                          |   40 +++++++++++
 gpu/README.md                           |   28 +++----
 gpu/bazel.screenrc                      |   11 +++
 gpu/env.json.sample                     |    7 +
 gpu/install_gpu_driver.sh               |  652 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------------------------------------------------------
 gpu/manual-test-runner.sh               |   77 +++++++++++++++++++++
 gpu/run-bazel-tests.sh                  |   24 ++++++
 gpu/test_gpu.py                         |  167 ++++++++++++++++++++++++++++++++-------------
 h2o/sample-script.py                    |   11 ---
 horovod/horovod.sh                      |    4 -
 horovod/test_horovod.py                 |   13 ++-
 hue/README.md                           |  118 ++++++++++++++++++++++++++++++++
 hue/another-query.png                   |binary
 hue/create-hive-table.png               |binary
 hue/hue-ui.png                          |binary
 hue/simple-hiveql.png                   |binary
 integration_tests/dataproc_test_case.py |   18 +++-
 rapids/BUILD                            |    2 
 rapids/Dockerfile                       |   40 +++++++++++
 rapids/bazel.screenrc                   |   17 ++++
 rapids/env.json.sample                  |    7 +
 rapids/manual-test-runner.sh            |   77 +++++++++++++++++++++
 rapids/rapids.sh                        |  814 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----------------------------------------------------------
 rapids/run-bazel-tests.sh               |   23 ++++++
 rapids/test_rapids.py                   |  137 +++++++++++--------------------------
 rapids/verify_rapids_dask.py            |   19 -----
 rapids/verify_rapids_dask_yarn.py       |   19 +++++
 spark-rapids/README.md                  |   19 -----
 spark-rapids/spark-rapids.sh            |   14 ++-
 35 files changed, 1943 insertions(+), 719 deletions(-)

cjac avatar Dec 25 '24 04:12 cjac

my apologies. ambiguous use of 'master' here. When diffed against origin master's commit, 169e98e424d87833e10b7761130f08bfde4b4815 I see what I expect.

cjac avatar Dec 25 '24 04:12 cjac

/gcbrun

cjac avatar Dec 25 '24 04:12 cjac

many tests have passed. still standing by for full green run.

cjac avatar Dec 25 '24 04:12 cjac

okay, that looks good.

cjac avatar Dec 25 '24 04:12 cjac