rfcs icon indicating copy to clipboard operation
rfcs copied to clipboard

RFC-0033-GDS-checkpointing

Open antferdom opened this issue 1 year ago • 11 comments

antferdom avatar Oct 20 '23 17:10 antferdom

Hi @antferdom!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

facebook-github-bot avatar Oct 20 '23 17:10 facebook-github-bot

@facebook-github-bot label commenting

antferdom avatar Oct 31 '23 18:10 antferdom

Thanks @antferdom --- we're starting to look into this now and are currently working on verifying if the performance looks promising compared to the current native PyTorch implementation(s)

CC @Aidyn-A @Fuzzkatt

eqy avatar Jan 24 '24 20:01 eqy

@eqy if possible I would like to join the formal evaluations and experimentation you and your team are considering to perform. We are excited about co-developing this feature, or even just correctly studying its viability and implications.

antferdom avatar Jan 24 '24 23:01 antferdom

Any update @eqy?

antferdom avatar Mar 19 '24 12:03 antferdom

@Aidyn-A and @mikaylagawarecki are currently working on it

eqy avatar Mar 19 '24 17:03 eqy

@eqy Thanks for your fast response, I appreciate it. Would it be possible for us to join the ongoing research about the feasibility of this? We truly want to push the development and integration of this to PyTorch core if it matches the performance expectations. @mikaylagawarecki

antferdom avatar Mar 20 '24 16:03 antferdom

Hey @antferdom, thank you for your enthusiasm in pushing this forward!

Let me try to give a summary of where we are at so far. From preliminary discussions my understanding is that there are 3 broad classes of cuFile APIs for GPUDirect Storage

(1) synchronous: cuFileRead/cuFileWrite (2) asynchronous: cuFileReadAsync/cuFileWriteAsync (3) batch i/o (threadpool-like): cuFileBatchIO*

If the performance properties were reasonable we had plans to (A) integrate into torch.save/load (B) provide thin wrapper torch APIs for users who do not use torch.save/load to save/load tensors

As a first step, we were trying to benchmark (1) with NVMe in non-compatibility mode. @Aidyn-A created a pytorch extension for synchronous saving/loading of tensors with benchmarking utilities here and I have a very preliminary prototype of upstreaming it into torch.save/load here

The install process for GPUDirect Storage in non-compatibility mode on the user end is tricky, we have not successfully gotten it to run in non-compatiblity mode with NVMe yet (the latest issue I personally had had to do with the nvme module being builtin rather than a loadable module in my kernel). So that is where we are at currently.

I am curious -- do you have benchmark numbers of the performance of GPUDirect Storage in non-compatibility mode? If so would you be willing to share these + the hardware configuration/filesystem type you used for the benchmarks? 😄

mikaylagawarecki avatar Mar 21 '24 20:03 mikaylagawarecki

Hi @mikaylagawarecki! absolutely, I will gladly share my benchmark configuration and results with you. I am dealing with some issues while trying to reproduce my original environment with the new Linux kernel version I'm using (6.5.0-14-generic). The following illustrates the uncomplete environment configuration for running without compatibility mode:

enabled: ============
 ENVIRONMENT:
 ============
 =====================
 DRIVER CONFIGURATION:
 =====================
 NVMe               : Unsupported
 NVMeOF             : Unsupported
 SCSI               : Unsupported
 ScaleFlux CSD      : Unsupported
 NVMesh             : Unsupported
 DDN EXAScaler      : Unsupported
 IBM Spectrum Scale : Unsupported
 NFS                : Unsupported
 BeeGFS             : Unsupported
 WekaFS             : Unsupported
 Userspace RDMA     : Unsupported
 --Mellanox PeerDirect : Disabled
 --rdma library        : Not Loaded (libcufile_rdma.so)
 --rdma devices        : Not configured
 --rdma_device_status  : Up: 0 Down: 0
 =====================
 CUFILE CONFIGURATION:
 =====================
 properties.use_compat_mode : true
 properties.force_compat_mode : false
 properties.gds_rdma_write_support : true
 properties.use_poll_mode : false
 properties.poll_mode_max_size_kb : 4
 properties.max_batch_io_size : 128
 properties.max_batch_io_timeout_msecs : 5
 properties.max_direct_io_size_kb : 16384
 properties.max_device_cache_size_kb : 131072
 properties.max_device_pinned_mem_size_kb : 33554432
 properties.posix_pool_slab_size_kb : 4 1024 16384
 properties.posix_pool_slab_count : 128 64 32
 properties.rdma_peer_affinity_policy : RoundRobin
 properties.rdma_dynamic_routing : 0
 fs.generic.posix_unaligned_writes : false
 fs.lustre.posix_gds_min_kb: 0
 fs.beegfs.posix_gds_min_kb: 0
 fs.weka.rdma_write_support: false
 fs.gpfs.gds_write_support: false
 profile.nvtx : false
 profile.cufile_stats : 0
 miscellaneous.api_check_aggressive : false
 execution.max_io_threads : 0
 execution.max_io_queue_depth : 128
 execution.parallel_io : false
 execution.min_io_threshold_size_kb : 8192
 execution.max_request_parallelism : 0
 =========
 GPU INFO:
 =========
 GPU index 0 NVIDIA A100-SXM4-40GB bar:1 bar size (MiB):65536 supports GDS, IOMMU State: Disabled
 GPU index 1 NVIDIA A100-SXM4-40GB bar:1 bar size (MiB):65536 supports GDS, IOMMU State: Disabled

I'm going to test the current implementation prototypes that you referenced, because maybe my initial attempts and their result no longer hold in comparison with the new ones generated by these implementations. As highlighted in the RFC, all my experiments made use of cuFile API via rapidsai/kvikio

antferdom avatar Mar 26 '24 00:03 antferdom

@mikaylagawarecki

Benchmarking GPUDirect in Non-Compatibility Mode

System Information

Distro Version:

Distributor ID: Ubuntu
Description:    Ubuntu 22.04.4 LTS
Release:        22.04

Kernel Version 5.15.0-101-generic

Hardware Configuration

| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-40GB          On  | 00000000:82:00.0 Off |                  Off |
| N/A   31C    P0              23W / 400W |      4MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-40GB          On  | 00000000:C1:00.0 Off |                  Off |
| N/A   31C    P0              25W / 400W |      4MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

IOMMU

/etc/default/grup: GRUB_CMDLINE_LINUX_DEFAULT="amd_iommu=off"

Status: $ sudo dmesg | grep -i iommu

[    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.15.0-101-generic root=/dev/mapper/vgroot-lvroot ro processor.max_cstate=1 amd_iommu=off
[    0.403876] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.15.0-101-generic root=/dev/mapper/vgroot-lvroot ro processor.max_cstate=1 amd_iommu=off
[    1.159437] iommu: Default domain type: Translated 
[    1.159437] iommu: DMA domain TLB invalidation policy: lazy mode

MLNX_OFED Requirements and Installation

reference: 14. Troubleshooting and FAQ for NVMe and NVMeOF Support

apt install nvidia-gds-12-1
apt install nvidia-fs=2.17.3-1 nvidia-fs-dkms=2.17.3-1 # to downgrade

modprobe nvidia-fs
./mlnxofedinstall --with-nvmf --with-nfsrdma --enable-gds --add-kernel-support --dkms
apt install --reinstall `dpkg -l | grep 545 | awk '{print $2}'`
modprobe nvidia-peermem
modprobe nvme-rdma
modprobe nvmet-rdma

Displaying GDS NVIDIA FS Driver Statistics

/proc/driver/nvidia-fs$ cat stats: display driver statistics

GDS Version: 1.7.1.12 
NVFS statistics(ver: 4.0)
NVFS Driver(version: 2.17.3)
Mellanox PeerDirect Supported: True
IO stats: Disabled, peer IO stats: Disabled
Logging level: info

Active Shadow-Buffer (MiB): 0
Active Process: 0
Reads                           : err=0 io_state_err=0
Sparse Reads                    : n=0 io=0 holes=0 pages=0 
Writes                          : err=0 io_state_err=0 pg-cache=0 pg-cache-fail=0 pg-cache-eio=0
Mmap                            : n=0 ok=0 err=0 munmap=0
Bar1-map                        : n=0 ok=0 err=0 free=0 callbacks=0 active=0 delay-frees=0
Error                           : cpu-gpu-pages=0 sg-ext=0 dma-map=0 dma-ref=0
Ops                             : Read=0 Write=0 BatchIO=0

Disk & Filesystem Information

lsblk: Overview of system disk and partitions

NAME              MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
nvme0n1           259:0    0  1.8T  0 disk 
├─nvme0n1p1       259:1    0    1M  0 part 
└─nvme0n1p2       259:2    0  1.8T  0 part 
  └─vgroot-lvroot 253:0    0  1.8T  0 lvm  /

sudo lshw -class disk: Disk details

 *-namespace:0             
       description: NVMe disk
       physical id: 0
       logical name: hwmon0
  *-namespace:1
       description: NVMe disk
       physical id: 2
       logical name: /dev/ng0n1
  *-namespace:2
       description: NVMe disk
       physical id: 1
       bus info: nvme@0:1
       logical name: /dev/nvme0n1
       size: 1863GiB (2TB)
       capabilities: gpt-1.00 partitioned partitioned:gpt
       configuration: guid=6546f52c-62a4-ed48-b04a-2551ce27034c logicalsectorsize=512 sectorsize=512 wwid=eui.e8238fa6bf530001001b448b4cacce2b

cat /sys/block/nvme0n1/device/model: disk model (NVMe devices)

WD Red SN700 2000GB

sudo virt-what: assert bare-metal machine

Verifiying a Successful GDS Installation

To verify that GDS installation was successful, run gdscheck:

/usr/local/cuda-12/gds/tools$ ./gdscheck -p

warn: error opening log file: Permission denied, logging will be disabled
 GDS release version: 1.6.1.9
 nvidia_fs version:  2.17 libcufile version: 2.12
 Platform: x86_64
 ============
 ENVIRONMENT:
 ============
 =====================
 DRIVER CONFIGURATION:
 =====================
 NVMe               : Supported
 NVMeOF             : Supported
 SCSI               : Unsupported
 ScaleFlux CSD      : Unsupported
 NVMesh             : Unsupported
 DDN EXAScaler      : Unsupported
 IBM Spectrum Scale : Unsupported
 NFS                : Unsupported
 BeeGFS             : Unsupported
 WekaFS             : Unsupported
 Userspace RDMA     : Unsupported
 --Mellanox PeerDirect : Enabled
 --rdma library        : Not Loaded (libcufile_rdma.so)
 --rdma devices        : Not configured
 --rdma_device_status  : Up: 0 Down: 0
 =====================
 CUFILE CONFIGURATION:
 =====================
 properties.use_compat_mode : true
 properties.force_compat_mode : false
 properties.gds_rdma_write_support : true
 properties.use_poll_mode : false
 properties.poll_mode_max_size_kb : 4
 properties.max_batch_io_size : 128
 properties.max_batch_io_timeout_msecs : 5
 properties.max_direct_io_size_kb : 16384
 properties.max_device_cache_size_kb : 131072
 properties.max_device_pinned_mem_size_kb : 33554432
 properties.posix_pool_slab_size_kb : 4 1024 16384 
 properties.posix_pool_slab_count : 128 64 32 
 properties.rdma_peer_affinity_policy : RoundRobin
 properties.rdma_dynamic_routing : 0
 fs.generic.posix_unaligned_writes : false
 fs.lustre.posix_gds_min_kb: 0
 fs.beegfs.posix_gds_min_kb: 0
 fs.weka.rdma_write_support: false
 fs.gpfs.gds_write_support: false
 profile.nvtx : false
 profile.cufile_stats : 0
 miscellaneous.api_check_aggressive : false
 execution.max_io_threads : 0
 execution.max_io_queue_depth : 128
 execution.parallel_io : false
 execution.min_io_threshold_size_kb : 8192
 execution.max_request_parallelism : 0
 =========
 GPU INFO:
 =========
 GPU index 0 NVIDIA A100-SXM4-40GB bar:1 bar size (MiB):65536 supports GDS, IOMMU State: Disabled
 GPU index 1 NVIDIA A100-SXM4-40GB bar:1 bar size (MiB):65536 supports GDS, IOMMU State: Disabled
 ==============
 PLATFORM INFO:
 ==============
 IOMMU: disabled
 Platform verification succeeded

Experiment 0: Synthetic 10 GB Torch tensor cuFile save/load

Assert Initial Conditions: NVFS status and properties

import kvikio
from kvikio import CuFile
import kvikio.defaults
from kvikio.defaults import set_compat_mode, compat_mode, compat_mode_reset


def test_compat_mode() -> None:
    before = compat_mode()
    print(f"Driver compat mode: {before}")
    with set_compat_mode(True):
        assert compat_mode()
        compat_mode_reset(False)
        assert not compat_mode()
    assert before == compat_mode()

# test_compat_mode()
print(f"Compability mode: {compat_mode()}")
handle = kvikio.libkvikio.DriverProperties()
props = kvikio.DriverProperties()
print(f"GDS Driver availability: {props.is_gds_available}")
if props.is_gds_available: print(f"v{props.major_version}.{props.minor_version}")
Compability mode: False
GDS Driver availability: True
v2.17

source code: cutorch.py

# %%
import kvikio
import kvikio.defaults
from kvikio.defaults import (
    get_num_threads,
    set_num_threads,
)
import cupy as cp
import torch

import logging
import time
import os
from pathlib import Path

logging.basicConfig(level=logging.INFO)
logger: logging.Logger = logging.getLogger(__name__)


TENSOR_DIMS = (50_000, 50_000)
TENSOR_FN = Path("consolidated.00.pth")
NUM_THREADS = 32
before = get_num_threads()

print(f"Tensor dimensions: {TENSOR_DIMS}")
print(f"Tensor fn: {TENSOR_FN}")
print(f"kvikio number of threads: {before}")
print(f"GPU number of threads: {NUM_THREADS}")
Tensor dimensions: (50000, 50000)
Tensor fn: consolidated.00.pth
kvikio number of threads: 1
GPU number of threads: 32
# %%
# cuFile serialization
st = time.perf_counter_ns()
x = torch.empty(*TENSOR_DIMS, device="cuda")
x_cu = cp.asarray(x)
# Write whole array to file
with kvikio.defaults.set_num_threads(NUM_THREADS):
    assert get_num_threads() == NUM_THREADS
    f = kvikio.CuFile(TENSOR_FN, "w")
    f.write(x_cu)
    f.close()
et = time.perf_counter_ns() - st
print(f"cuFile serilization elapsed time: {et*1e-9:.2f} s")
del x, x_cu
torch.cuda.empty_cache()
cuFile serilization elapsed time: 3.72 s
# %%
# cuFile torch tensor deserialization
import cupy
# import cunumeric as num
tensor_size = os.path.getsize(TENSOR_FN)
print(f"Tensor size: {tensor_size / 1e09:.2f} GB")
x_cu = cp.asarray(torch.empty(*TENSOR_DIMS, device="cuda"))
# x_cu = cp.empty(shape=(50_000, 50_000))
st = time.perf_counter_ns()
with kvikio.defaults.set_num_threads(NUM_THREADS):
    assert get_num_threads() == NUM_THREADS
    st = time.perf_counter_ns()
    f = kvikio.CuFile(TENSOR_FN, "r")
    f.read(x_cu)
    x_cutorch = torch.as_tensor(x_cu, device="cuda")
print(f"Tensor loading time: {(time.perf_counter_ns() - st)*1e-9:.4f} s")
print(f"Device: {x_cutorch.device}")
# %%
del x_cutorch, x_cu
torch.cuda.empty_cache()
Tensor size: 10.00 GB
Tensor loading time: 3.4625 s
Device: cuda:0

Verify that system caches are not impacting the experiment measurements:

vmtouch consolidated.00.pth 
           Files: 1
     Directories: 0
  Resident Pages: 191/2441407  764K/9G  0.00782%
         Elapsed: 0.060878 seconds

antferdom avatar Mar 27 '24 09:03 antferdom

DeepSpeed upcoming feature focusing NVME technologies with GDS (version >= 0.15). DeepNVMe: Improving DL Applications through I/O Optimizations @mikaylagawarecki

antferdom avatar Aug 05 '24 13:08 antferdom