dask-cuda icon indicating copy to clipboard operation
dask-cuda copied to clipboard

[FEA] Support for NUMA-aware spilling

Open lmeyerov opened this issue 3 years ago • 5 comments

We're trying to optimize a dask-cudf job for NUMA: DGX with multiple storage controllers -> 4 NUMA nodes -> 8 GPUs. For predictability, we're trying to tag workers with resources like:

environment:
  - CUDA_VISIBILE_DEVICES=3
cmd: dask-cuda-worker --resources "GPU=1 GPU3=1 NUMA5=1 "

In turn, that would let us color dask subgraphs by NUMA info:

tasks = []
for node in [1,3,5,7]:
  with dask.annotate(resources={f'NUMA{node}': 2}):
    tasks.append(...)
wait(...)

However, we're seeing unrecognized arg errors around initializing dask-cuda-worker wrt multiple resources.

Is there a known happy path / example here for tagging workers with multiple resources? I can work to provide a repro / bug reports, but meanwhile, have tried a few syntaxes + passing in via env vars, and no go, and was going to try starting via python next...

lmeyerov avatar Mar 14 '21 02:03 lmeyerov

For posterity commas + explicit args for docker-compose.yml:

  dask-cuda-worker0:
    <<: *worker_opts
    environment:
      - CUDA_VISIBLE_DEVICES=0
    command: ["dask-cuda-worker", "--interface", "eth0", "--no-dashboard", "--dashboard-address", "dask-cuda-worker0:8787", "--resources", "GPU=1,GPU0=1,NUMA3=1", "dask-scheduler:8786"]

Otherwise, looks like docker was eating the quotes, confusing the dask's CLI parsing

lmeyerov avatar Mar 15 '21 01:03 lmeyerov

I'm not sure if you're trying to do something much more specific, but one of the features Dask-CUDA provides is setting CPU affinity (NUMA-ness) for each GPU, see https://github.com/rapidsai/dask-cuda/blob/09196cb5c92effca6da660231e58a5bf4ac72c76/dask_cuda/cuda_worker.py#L215 https://github.com/rapidsai/dask-cuda/blob/09196cb5c92effca6da660231e58a5bf4ac72c76/dask_cuda/local_cuda_cluster.py#L311 https://github.com/rapidsai/dask-cuda/blob/09196cb5c92effca6da660231e58a5bf4ac72c76/dask_cuda/utils.py#L26-L31

With that said, unless you have a more specific use case in mind, you probably don't need to do all that by yourself.

pentschev avatar Mar 15 '21 14:03 pentschev

Thanks @pentschev , that's wonderful to hear wrt CPU core / NUMA node. Exploring that was on my list for today/tmw!

This is likely more of a dask_cudf q, but maybe here too: Any settings to check/set for parallel IO (file read/write, task spill...)?

-- Storage: Is there a good way to force NUMA-aware reads/writes? Our storage tier features a parallel array of many 20 GB/s devices, with some sort of balanced matchup to the 4 NUMA nodes on our system, so we want to ensure every GPU is reading/writing on its designated pathways without stomping on another's stream. For reads/writes, we are manually coloring the task graph: with dask.annotate(resources={'NUMA_<NODE#>': 1}): dask_cudf.read_parquet('/fs_numa_<NODE#>/...'). But there may be something smarter? (FWIW, we'll be enabling GDS as soon as it hits nightly's, for the read/write case, but not spills)

-- Ephemeral state: We also have pretty good local storage (though I'm fuzzy on it), and if implications for GPU<>NUMA node<>local storage, curious about that as well. With our 100+GB/s of parallel GDS controllers, it may even be better to spill there. But again, not sure if settings we should tweak. Ex: I see --local-directory as a worker option, which I can set to different mounts on different IO paths, so maybe that's all that's needed?

-- Striping: For the above, for predictability, we're partitioning data in a coarse manner across drives for better parallel reads: assign parquet files round robin across disks (or duplicated), and app code can manually colors the task graph accordingly (above example) . However, for enabling automatic parallelism that hides NUMA details, which would be much better for app devs, we also are experimenting with using a unified FS view that automatically stripes at smaller/tunable chunk sizes across storage units. Is there a way for dask-cuda to understand / leverage something like that? Ideally, we can hide most of the details at the infra / dask config layer, and app devs don't need to be NUMA-aware..

lmeyerov avatar Mar 15 '21 16:03 lmeyerov

Thanks for the details @lmeyerov . As you noticed, the --local-directory is something you could set to have spilling on a high bandwidth FS. However, the remaining of your question is currently out-of-scope for Dask-CUDA, in a very high-level I foresee it would require some heavy changing in the spilling mechanism, as well NUMA identification based on device topology or support for manual configuration, which are all non-trivial tasks.

I'll for now mark this as a feature request, but we can't provide an ETA on that. If you or anybody from the community is willing to work on such a feature we would welcome the contribution as well!

pentschev avatar Mar 15 '21 19:03 pentschev

Yeah, NUMA/locality-aware scheduling is def an R&D-level problem in practice, so I get it :) It'd be cool to be able to setup/automate NUMA tags for a multi-gpu node as part of a single CLI launch, but for now, looks like we either need to do multiple manually-tweaked per-GPU CLI services in our docker compose, or some sort of custom init script to do as one.

If there are other NUMA-relevant settings we should poke at, would appreciate. Thanks again!

lmeyerov avatar Mar 15 '21 19:03 lmeyerov