cudf icon indicating copy to clipboard operation
cudf copied to clipboard

Add an option to run cuIO benchmarks with pinned buffers as input

Open vuule opened this issue 1 year ago • 1 comments

Description

Adds io_type::PINNED_BUFFER, which allows cuIO benchmarks to use a pinned buffer as an input. The output is still a std::vector in this case, same as with io_type::HOST_BUFFER. Also stops the used of cudf::io::io_type in benchmarks, to allow benchmark-specific IO types, such as this one.

TODO: Run multithreaded parquet benchmark with pinned and pageable input on a lab system.

Checklist

  • [x] I am familiar with the Contributing Guidelines.
  • [ ] New or existing tests cover these changes.
  • [ ] The documentation is up to date with these changes.

vuule avatar May 22 '24 21:05 vuule

CC @GregoryKimball @nvdbaranec

vuule avatar May 22 '24 23:05 vuule

Parquet reader benchmarks (partial) show clear signal compared to pageable input:

|    io_type    | compression_type | cardinality | run_length | Samples |  CPU Time  | Noise |  GPU Time  | Noise | bytes_per_second | peak_memory_usage | encoded_file_size |
|---------------|------------------|-------------|------------|---------|------------|-------|------------|-------|------------------|-------------------|-------------------|
| PINNED_BUFFER |           SNAPPY |           0 |          1 |      6x |  94.025 ms | 0.38% |  94.016 ms | 0.38% |       5710404236 |         1.365 GiB |       463.356 MiB |
|   HOST_BUFFER |           SNAPPY |           0 |          1 |      5x | 109.794 ms | 0.35% | 109.785 ms | 0.35% |       4890200847 |         1.365 GiB |       463.356 MiB |
| DEVICE_BUFFER |           SNAPPY |           0 |          1 |      7x |  74.802 ms | 0.31% |  74.794 ms | 0.31% |       7178027874 |         1.365 GiB |       463.356 MiB |
| PINNED_BUFFER |             NONE |           0 |          1 |    285x |  52.570 ms | 1.84% |  52.561 ms | 1.84% |      10214162353 |       976.374 MiB |       472.458 MiB |
|   HOST_BUFFER |             NONE |           0 |          1 |    235x |  63.742 ms | 8.74% |  63.733 ms | 8.74% |       8423736635 |       976.374 MiB |       472.458 MiB |
| DEVICE_BUFFER |             NONE |           0 |          1 |    486x |  30.752 ms | 1.12% |  30.743 ms | 1.12% |      17462916129 |       976.374 MiB |       472.458 MiB |
| PINNED_BUFFER |           SNAPPY |        1000 |          1 |    303x |  49.492 ms | 0.87% |  49.483 ms | 0.87% |      10849505601 |       799.405 MiB |       149.632 MiB |
|   HOST_BUFFER |           SNAPPY |        1000 |          1 |     80x |  54.009 ms | 1.21% |  54.000 ms | 1.21% |       9941990707 |       799.451 MiB |       149.632 MiB |
| DEVICE_BUFFER |           SNAPPY |        1000 |          1 |     21x |  43.121 ms | 0.50% |  43.113 ms | 0.50% |      12452696478 |       799.405 MiB |       149.632 MiB |
| PINNED_BUFFER |             NONE |        1000 |          1 |    330x |  45.322 ms | 1.40% |  45.313 ms | 1.40% |      11847938588 |       660.763 MiB |       157.620 MiB |
|   HOST_BUFFER |             NONE |        1000 |          1 |    307x |  48.737 ms | 0.99% |  48.728 ms | 0.99% |      11017642711 |       660.763 MiB |       157.620 MiB |
| DEVICE_BUFFER |             NONE |        1000 |          1 |     14x |  37.741 ms | 0.35% |  37.732 ms | 0.35% |      14228494999 |       660.763 MiB |       157.620 MiB |
| PINNED_BUFFER |           SNAPPY |           0 |         32 |    240x |  46.794 ms | 0.89% |  46.785 ms | 0.89% |      11475211343 |       980.738 MiB |        64.295 MiB |
|   HOST_BUFFER |           SNAPPY |           0 |         32 |    305x |  49.157 ms | 1.48% |  49.148 ms | 1.48% |      10923512449 |       980.742 MiB |        64.295 MiB |
| DEVICE_BUFFER |           SNAPPY |           0 |         32 |     12x |  43.601 ms | 0.48% |  43.592 ms | 0.48% |      12315840649 |       980.738 MiB |        64.295 MiB |
| PINNED_BUFFER |             NONE |           0 |         32 |    325x |  46.055 ms | 0.92% |  46.046 ms | 0.92% |      11659360859 |       918.591 MiB |       413.967 MiB |
|   HOST_BUFFER |             NONE |           0 |         32 |     80x |  56.248 ms | 1.16% |  56.238 ms | 1.16% |       9546324040 |       918.591 MiB |       413.967 MiB |
| DEVICE_BUFFER |             NONE |           0 |         32 |    208x |  27.393 ms | 0.77% |  27.385 ms | 0.77% |      19604901646 |       918.591 MiB |       413.967 MiB |
| PINNED_BUFFER |           SNAPPY |        1000 |         32 |    383x |  39.060 ms | 1.18% |  39.052 ms | 1.18% |      13747741214 |       557.858 MiB |        24.034 MiB |
|   HOST_BUFFER |           SNAPPY |        1000 |         32 |     13x |  39.797 ms | 0.48% |  39.787 ms | 0.47% |      13493467556 |       557.865 MiB |        24.034 MiB |
| DEVICE_BUFFER |           SNAPPY |        1000 |         32 |    394x |  37.948 ms | 1.85% |  37.939 ms | 1.85% |      14150749943 |       557.858 MiB |        24.034 MiB |
| PINNED_BUFFER |             NONE |        1000 |         32 |    112x |  35.622 ms | 1.25% |  35.613 ms | 1.25% |      15074930029 |       533.921 MiB |        30.799 MiB |
|   HOST_BUFFER |             NONE |        1000 |         32 |    409x |  36.558 ms | 1.97% |  36.549 ms | 1.97% |      14689106036 |       533.921 MiB |        30.799 MiB |
| DEVICE_BUFFER |             NONE |        1000 |         32 |    272x |  33.879 ms | 1.07% |  33.870 ms | 1.07% |      15850926773 |       533.921 MiB |        30.799 MiB |

On the other hand, not observing a clear signal with the multithreaded Parquet benchmark, even in the single-threaded cases. Something we'll want to investigate as we look further into multithreaded scaling.

vuule avatar May 28 '24 22:05 vuule

/merge

vuule avatar Jun 03 '24 17:06 vuule