kvikio why is compat mode faster than gpudirect read with the given python example?

Hi, I tried the gdsio tool and works fine as expected, compat mode is slower than GPU direct reads. But when checking this library using the given python example, it doesn't work the same as expected (compat mode is faster than gpudirect read). Could you please let me know?

changed example that is given in README.md

!/usr/bin/python
import cupy
import kvikio
import time
import kvikio.defaults

def main(nelem):
    print("Tensor size:", nelem)
    path = "/mnt/tmp/test-file"

    start_time = time.time()
    a = cupy.arange(nelem)
    f = kvikio.CuFile(path, "w")
    # Write whole array to file
    f.write(a)
    f.close()
    print("--- %s seconds write time---" % (time.time() - start_time))

    # Read whole array from file
    start_time = time.time()
    b = cupy.empty_like(a)
    print("--- %s seconds buffer creation---" % (time.time() - start_time))
    f = kvikio.CuFile(path, "r")
    f.read(b)
    print("--- %s seconds buffer creation and load time---" % (time.time() - start_time))
    assert all(a == b)

    # Use contexmanager
    start_time = time.time()
    c = cupy.empty_like(a)
    with kvikio.CuFile(path, "r") as f:
        f.read(c)
    print("--- %s seconds buffer creation and load time with context manager---" % (time.time() - start_time))
    assert all(a == c)

    # Non-blocking read
    start_time = time.time()
    d = cupy.empty_like(a)
    with kvikio.CuFile(path, "r") as f:
        future1 = f.pread(d[:nelem/2])
        future2 = f.pread(d[nelem/2:], file_offset=d[:nelem/2].nbytes)
        future1.get()  # Wait for first read
        future2.get()  # Wait for second read
    print("--- %s seconds buffer creation and load time with non block read---" % (time.time() - start_time))
    start_time = time.time()
    assert all(a == d)
    print("--- %s seconds assertion time---" % (time.time() - start_time))

arr_sizes = [100, 1000000]
kvikio.defaults.compat_mode_reset(False)
assert not kvikio.defaults.compat_mode()
for elem in arr_sizes:
    main(elem)
kvikio.defaults.compat_mode_reset(True)
assert kvikio.defaults.compat_mode()
print("COMPAT MODE..")
for elem in arr_sizes:
    main(elem)

output:

Tensor size: 100 --- 0.36174535751342773 seconds write time--- --- 0.00011444091796875 seconds buffer creation--- --- 0.003509044647216797 seconds buffer creation and load time--- --- 0.000995635986328125 seconds buffer creation and load time with context manager--- --- 0.0020360946655273438 seconds buffer creation and load time with non block read--- --- 0.0019338130950927734 seconds assertion time--- Tensor size: 1000000 --- 0.3805253505706787 seconds write time--- --- 0.0003936290740966797 seconds buffer creation--- --- 0.02455925941467285 seconds buffer creation and load time--- --- 0.045484304428100586 seconds buffer creation and load time with context manager--- --- 0.07375788688659668 seconds buffer creation and load time with non block read--- --- 18.388749361038208 seconds assertion time--- COMPAT MODE.. Tensor size: 100 --- 0.04293179512023926 seconds write time--- --- 9.965896606445312e-05 seconds buffer creation--- --- 0.00044655799865722656 seconds buffer creation and load time--- --- 0.0001728534698486328 seconds buffer creation and load time with context manager--- --- 0.00022649765014648438 seconds buffer creation and load time with non block read--- --- 0.0018930435180664062 seconds assertion time--- Tensor size: 1000000 --- 0.05194258689880371 seconds write time--- --- 1.8596649169921875e-05 seconds buffer creation--- --- 0.002271890640258789 seconds buffer creation and load time--- --- 0.002416372299194336 seconds buffer creation and load time with context manager--- --- 0.0020689964294433594 seconds buffer creation and load time with non block read--- --- 18.475173473358154 seconds assertion time---

Apr 03 '23 18:04 srikanthmalla

Could you try with some large buffers like:

KVIKIO_COMPAT_MODE=ON python python/benchmarks/single-node-io.py --nruns 5 --nbytes 100MB
KVIKIO_COMPAT_MODE=OFF python python/benchmarks/single-node-io.py --nruns 5 --nbytes 100MB

Apr 12 '23 07:04 madsbk

@madsbk The benchmark/single-node-io.py file seems to output reasonably. But why doesn't it work in the example I showed above? Also, I have two more questions:

How to choose a GPU device id for a multi-GPU system?
How to choose a random/sequential option for the read/write strategy?

I am wondering about these options because there are also more options (like numa node, including the above) in gdsio tool. Can I pass those options through this tool? Please let me know.

Apr 13 '23 20:04 srikanthmalla

@madsbk The benchmark/single-node-io.py file seems to output reasonably. But why doesn't it work in the example I showed above?

I guess it is because of the initialization overhead. Try to time the writing only:

    a = cupy.arange(nelem)
    f = kvikio.CuFile(path, "w")
    # Write whole array to file
    start_time = time.time()
    f.write(a)
    f.close()
    print("--- %s seconds write time---" % (time.time() - start_time))

I am working on a PR that enables compat mode and disable the threadpool for small buffers (<1MB) by default: https://github.com/rapidsai/kvikio/pull/190

How to choose a GPU device id for a multi-GPU system?

Set the environment variable to the device index like: CUDA_VISIBLE_DEVICES=3

How to choose a random/sequential option for the read/write strategy?

I am not sure I follow, KvikIO will schedule IO using a threadpool of size KVIKIO_NTHREADS and each thread works on KVIKIO_TASK_SIZE bytes. It isn't possible to control the order of the thread execution.

I am wondering about these options because there are also more options (like numa node, including the above) in gdsio tool. Can I pass those options through this tool? Please let me know.

You can set cuFile specific options in the cufile.json config file. It there a particular option you are thinking about?

Apr 14 '23 06:04 madsbk

Hi @madsbk Thank you for your response.

If I want to use multi-gpu CUDA_VISIBLE_DEVICES='3,4', in this scenario can I control the flow of read operation to which device in the python script using this library?
regarding the other options, the KvikIO C++ example has shown the option to set a number of threads but not in Python. Is it possible to set KVIKIO_NTHREADS and KVIKIO_TASK_SIZE from Python as well?
when I mentioned the NUMA node and Sequential/Random read, GDS benchmarking page https://docs.nvidia.com/gpudirect-storage/configuration-guide/index.html uses gdsio tool from their toolkit to benchmark throughput and latency. It has these options, they are not from cufile.json. That is the reason I was asking if they can be set up from this library as well.

>>./gdsio --help
gdsio version : XXX
Usage [using config file]: gdsio rw-sample.gdsio
Usage [using cmd line options]:./gdsio
         -f <file name>
         -D <directory name>
         -d <gpu_index (refer nvidia-smi)>
         -n <numa node>
         -w <number of threads for a job>
         -s <file size(K|M|G)>
         -o <start offset(K|M|G)>
         -i <io_size(K|M|G)> <min_size:max_size:step_size>
         -p <enable nvlinks>
         -b <skip bufregister>
         -o <start file offset>
         -V <verify IO>
         -x <xfer_type>
         -I <(read) 0|(write)1| (randread) 2| (randwrite) 3>
         -T <duration in seconds>
         -k <random_seed> (number e.g. 3456) to be used with random read/write>
         -U <use unaligned(4K) random offsets>
         -R <fill io buffer with random data>
         -F <refill io buffer with random data during each write>
         -B <batch size>

xfer_type:
0 - Storage->GPU (GDS)
1 - Storage->CPU
2 - Storage->CPU->GPU
3 - Storage->CPU->GPU_ASYNC
4 - Storage->PAGE_CACHE->CPU->GPU
5 - Storage->GPU_ASYNC
6 - Storage->GPU_BATCH

Thank you so much!

Apr 17 '23 17:04 srikanthmalla

1. If I want to use multi-gpu CUDA_VISIBLE_DEVICES='3,4', in this scenario can I control the flow of read operation to which device in the python script using this library?

Currently, python/benchmarks/single-node-io.py only reads and writes to a single GPU.

2. regarding the other options, the KvikIO C++ example has shown the option to set a number of threads but not in Python. Is it possible to set KVIKIO_NTHREADS and KVIKIO_TASK_SIZE from Python as well?

Yes, you can do something like the following (see test_defaults.py for more examples):


with kvikio.defaults.set_num_threads(2):
    with kvikio.defaults.set_task_size(3):
        my code

3. when I mentioned the NUMA node and Sequential/Random read, GDS benchmarking page https://docs.nvidia.com/gpudirect-storage/configuration-guide/index.html uses gdsio tool from their toolkit to benchmark throughput and latency. It has these options, they are not from cufile.json. That is the reason I was asking if they can be set up from this library as well.

Again, python/benchmarks/single-node-io.py doesn't support this :/

Apr 19 '23 08:04 madsbk

kvikio kvikio copied to clipboard

why is compat mode faster than gpudirect read with the given python example?

kvikio
kvikio copied to clipboard