Codecov Report

Merging #402 into branch-0.16 will decrease coverage by 2.65%. The diff coverage is 69.23%.

@@               Coverage Diff               @@
##           branch-0.16     #402      +/-   ##
===============================================
- Coverage        59.65%   56.99%   -2.66%     
===============================================
  Files               17       19       +2     
  Lines             1331     1451     +120     
===============================================
+ Hits               794      827      +33     
- Misses             537      624      +87

Impacted Files	Coverage Δ
dask_cuda/utils.py	`85.31% <69.23%> (-2.00%)`	:arrow_down:
dask_cuda/explicit_comms/dataframe_merge.py	`90.38% <0.00%> (-0.45%)`	:arrow_down:
dask_cuda/explicit_comms/__init__.py	`100.00% <0.00%> (ø)`
dask_cuda/benchmarks/local_cupy_map_overlap.py	`0.00% <0.00%> (ø)`
dask_cuda/explicit_comms/dataframe_shuffle.py	`95.52% <0.00%> (ø)`
dask_cuda/device_host_file.py	`98.64% <0.00%> (+0.03%)`	:arrow_up:
dask_cuda/cli/dask_cuda_worker.py	`96.77% <0.00%> (+0.05%)`	:arrow_up:
dask_cuda/initialize.py	`92.59% <0.00%> (+0.28%)`	:arrow_up:
dask_cuda/cuda_worker.py	`72.28% <0.00%> (+0.86%)`	:arrow_up:
dask_cuda/local_cuda_cluster.py	`82.95% <0.00%> (+0.93%)`	:arrow_up:
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 71fad3a...0c9f29a. Read the comment docs.

Sep 25 '20 21:09 codecov-commenter

Would this make sense to do generally or have we found issues with this approach in some cases? Asking as we now know of 2 cases where NVML is not an option Tegra and WSL.

Sep 25 '20 21:09 jakirkham

Would this make sense to do generally or have we found issues with this approach in some cases? Asking as we now know of 2 cases where NVML is not an option Tegra and WSL.

Using Numba to do that implies creating a context, and that can be tricky in certain situations where we need the context creation to be delayed. We should evaluate context creation on a case-by-case basis as this has proven problematic in several instances in the past.

TBH, I don't know if things are 100% correct for Tegra, but it's very difficult to evaluate much further without being able to test. Long story short: I hope this PR will provide some support to Tegra but we can't make any guarantees without extensive testing, same applies for WSL or any other platforms we may want to support in the future.

Sep 28 '20 09:09 pentschev

Using Numba to do that implies creating a context, and that can be tricky in certain situations where we need the context creation to be delayed. We should evaluate context creation on a case-by-case basis as this has proven problematic in several instances in the past.

Won't it already be tricky to initialize the context? AIUI we wanted this for multi-GPU devices.

TBH, I don't know if things are 100% correct for Tegra, but it's very difficult to evaluate much further without being able to test. Long story short: I hope this PR will provide some support to Tegra but we can't make any guarantees without extensive testing, same applies for WSL or any other platforms we may want to support in the future.

Fair enough I think we have to rely on folks being pretty engaged to keep things working. This is also why I'm wondering if we can standardize on a common code path that we are able to test in other cases at least.

Oct 07 '20 17:10 jakirkham

On a discussion with @JasonAtNvidia and @jakirkham , we decided to move this to 0.17. One of the limitations we found is getting GPU information, such as memory, without NVML. That requires Numba, and since we replace its memory manager with RMM's and RMM doesn't implement such functionality, we get errors:

============================= test session starts ==============================
platform linux -- Python 3.7.8, pytest-6.1.1, py-1.9.0, pluggy-0.13.1 -- /mnt/data/miniforge3/envs/dasktest/bin/python
cachedir: .pytest_cache
rootdir: /home/nvidia/Documents/dask-cuda
collecting ... collected 998 items / 1 error / 1 skipped / 996 selected
==================================== ERRORS ====================================
________________ ERROR collecting dask_cuda/tests/test_spill.py ________________
dask_cuda/tests/test_spill.py:18: in <module>
    if utils.get_device_total_memory() < 1e10:
/mnt/data/miniforge3/envs/dasktest/lib/python3.7/site-packages/dask_cuda-0.16.0a0+95.g04dcbb6.dirty-py3.7.egg/dask_cuda/utils.py:175: in get_device_total_memory
    return numba.cuda.current_context().get_memory_info()[1]
/mnt/data/miniforge3/envs/dasktest/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py:1040: in get_memory_info
    return self.memory_manager.get_memory_info()
/mnt/data/miniforge3/envs/dasktest/lib/python3.7/site-packages/rmm/rmm.py:214: in get_memory_info
    raise NotImplementedError()
E   NotImplementedError
=============================== warnings summary ===============================
../../../../mnt/data/miniforge3/envs/dasktest/lib/python3.7/site-packages/pandas/util/__init__.py:12
  /mnt/data/miniforge3/envs/dasktest/lib/python3.7/site-packages/pandas/util/__init__.py:12: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
    import pandas.util.testing
dask_cuda/tests/test_local_cuda_cluster.py:76
  /home/nvidia/Documents/dask-cuda/dask_cuda/tests/test_local_cuda_cluster.py:76: PytestUnknownMarkWarning: Unknown pytest.mark.asyncio - is this a typo?  You can register custom marks to avoid this warning - for details, see https://docs.pytest.org/en/stable/mark.html
    @pytest.mark.asyncio
dask_cuda/tests/test_local_cuda_cluster.py:89
  /home/nvidia/Documents/dask-cuda/dask_cuda/tests/test_local_cuda_cluster.py:89: PytestUnknownMarkWarning: Unknown pytest.mark.asyncio - is this a typo?  You can register custom marks to avoid this warning - for details, see https://docs.pytest.org/en/stable/mark.html
    @pytest.mark.asyncio
-- Docs: https://docs.pytest.org/en/stable/warnings.html
=========================== short test summary info ============================
ERROR dask_cuda/tests/test_spill.py - NotImplementedError
!!!!!!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!!!!!!
=================== 1 skipped, 3 warnings, 1 error in 4.28s ====================

Oct 07 '20 19:10 pentschev

Won't it already be tricky to initialize the context? AIUI we wanted this for multi-GPU devices.

Even single-GPU devices can benefit from these changes for Tegra clusters.

Fair enough I think we have to rely on folks being pretty engaged to keep things working. This is also why I'm wondering if we can standardize on a common code path that we are able to test in other cases at least.

It seems we'll soon begin testing more with Tegra, so that will help us here. To be frank, I don't want to move away from NVML only because of Tegra, NVML is much better in various ways (e.g., no need to create a CUDA context) and supports things like gathering CPU affinity for each GPU that other tools don't, so I'd rather have two code paths.

Oct 07 '20 20:10 pentschev

It seems we'll have some Tegra devices in CI soon, we can then push this PR forward.

Nov 30 '20 21:11 pentschev

This PR has been marked stale due to no recent activity in the past 30d. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be marked rotten if there is no activity in the next 60d.

Feb 16 '21 19:02 github-actions[bot]

I believe we are still waiting on Tegra devices

Feb 17 '21 13:02 quasiben

@mike-wendt @raydouglass based on the discussion from Ops Demo Tuesday, this is a PR we would like to test with Tegra devices when we have it available in CI.

Feb 23 '21 22:02 pentschev

@mike-wendt @raydouglass based on the discussion from Ops Demo Tuesday, this is a PR we would like to test with Tegra devices when we have it available in CI.

@pentschev @quasiben as an update @Ethyling and I are working on getting the L4T images integrated on Monday. Once we get that done we'll have a platform to easily test these builds for CUDA 10.2 as that's the only current CUDA version supported on L4T/AGX.

Mar 19 '21 15:03 mike-wendt

Sounds great, thanks @mike-wendt and let us know if/how we can assist.

Mar 19 '21 16:03 pentschev

dask-cuda
dask-cuda copied to clipboard

Skip pyNVML to support Tegra devices

Codecov Report

dask-cuda dask-cuda copied to clipboard

Skip pyNVML to support Tegra devices

Codecov Report

dask-cuda
dask-cuda copied to clipboard