Add an option to run cuIO benchmarks with pinned buffers as input
Description
Adds io_type::PINNED_BUFFER, which allows cuIO benchmarks to use a pinned buffer as an input. The output is still a std::vector in this case, same as with io_type::HOST_BUFFER.
Also stops the used of cudf::io::io_type in benchmarks, to allow benchmark-specific IO types, such as this one.
TODO: Run multithreaded parquet benchmark with pinned and pageable input on a lab system.
Checklist
- [x] I am familiar with the Contributing Guidelines.
- [ ] New or existing tests cover these changes.
- [ ] The documentation is up to date with these changes.
CC @GregoryKimball @nvdbaranec
Parquet reader benchmarks (partial) show clear signal compared to pageable input:
| io_type | compression_type | cardinality | run_length | Samples | CPU Time | Noise | GPU Time | Noise | bytes_per_second | peak_memory_usage | encoded_file_size |
|---------------|------------------|-------------|------------|---------|------------|-------|------------|-------|------------------|-------------------|-------------------|
| PINNED_BUFFER | SNAPPY | 0 | 1 | 6x | 94.025 ms | 0.38% | 94.016 ms | 0.38% | 5710404236 | 1.365 GiB | 463.356 MiB |
| HOST_BUFFER | SNAPPY | 0 | 1 | 5x | 109.794 ms | 0.35% | 109.785 ms | 0.35% | 4890200847 | 1.365 GiB | 463.356 MiB |
| DEVICE_BUFFER | SNAPPY | 0 | 1 | 7x | 74.802 ms | 0.31% | 74.794 ms | 0.31% | 7178027874 | 1.365 GiB | 463.356 MiB |
| PINNED_BUFFER | NONE | 0 | 1 | 285x | 52.570 ms | 1.84% | 52.561 ms | 1.84% | 10214162353 | 976.374 MiB | 472.458 MiB |
| HOST_BUFFER | NONE | 0 | 1 | 235x | 63.742 ms | 8.74% | 63.733 ms | 8.74% | 8423736635 | 976.374 MiB | 472.458 MiB |
| DEVICE_BUFFER | NONE | 0 | 1 | 486x | 30.752 ms | 1.12% | 30.743 ms | 1.12% | 17462916129 | 976.374 MiB | 472.458 MiB |
| PINNED_BUFFER | SNAPPY | 1000 | 1 | 303x | 49.492 ms | 0.87% | 49.483 ms | 0.87% | 10849505601 | 799.405 MiB | 149.632 MiB |
| HOST_BUFFER | SNAPPY | 1000 | 1 | 80x | 54.009 ms | 1.21% | 54.000 ms | 1.21% | 9941990707 | 799.451 MiB | 149.632 MiB |
| DEVICE_BUFFER | SNAPPY | 1000 | 1 | 21x | 43.121 ms | 0.50% | 43.113 ms | 0.50% | 12452696478 | 799.405 MiB | 149.632 MiB |
| PINNED_BUFFER | NONE | 1000 | 1 | 330x | 45.322 ms | 1.40% | 45.313 ms | 1.40% | 11847938588 | 660.763 MiB | 157.620 MiB |
| HOST_BUFFER | NONE | 1000 | 1 | 307x | 48.737 ms | 0.99% | 48.728 ms | 0.99% | 11017642711 | 660.763 MiB | 157.620 MiB |
| DEVICE_BUFFER | NONE | 1000 | 1 | 14x | 37.741 ms | 0.35% | 37.732 ms | 0.35% | 14228494999 | 660.763 MiB | 157.620 MiB |
| PINNED_BUFFER | SNAPPY | 0 | 32 | 240x | 46.794 ms | 0.89% | 46.785 ms | 0.89% | 11475211343 | 980.738 MiB | 64.295 MiB |
| HOST_BUFFER | SNAPPY | 0 | 32 | 305x | 49.157 ms | 1.48% | 49.148 ms | 1.48% | 10923512449 | 980.742 MiB | 64.295 MiB |
| DEVICE_BUFFER | SNAPPY | 0 | 32 | 12x | 43.601 ms | 0.48% | 43.592 ms | 0.48% | 12315840649 | 980.738 MiB | 64.295 MiB |
| PINNED_BUFFER | NONE | 0 | 32 | 325x | 46.055 ms | 0.92% | 46.046 ms | 0.92% | 11659360859 | 918.591 MiB | 413.967 MiB |
| HOST_BUFFER | NONE | 0 | 32 | 80x | 56.248 ms | 1.16% | 56.238 ms | 1.16% | 9546324040 | 918.591 MiB | 413.967 MiB |
| DEVICE_BUFFER | NONE | 0 | 32 | 208x | 27.393 ms | 0.77% | 27.385 ms | 0.77% | 19604901646 | 918.591 MiB | 413.967 MiB |
| PINNED_BUFFER | SNAPPY | 1000 | 32 | 383x | 39.060 ms | 1.18% | 39.052 ms | 1.18% | 13747741214 | 557.858 MiB | 24.034 MiB |
| HOST_BUFFER | SNAPPY | 1000 | 32 | 13x | 39.797 ms | 0.48% | 39.787 ms | 0.47% | 13493467556 | 557.865 MiB | 24.034 MiB |
| DEVICE_BUFFER | SNAPPY | 1000 | 32 | 394x | 37.948 ms | 1.85% | 37.939 ms | 1.85% | 14150749943 | 557.858 MiB | 24.034 MiB |
| PINNED_BUFFER | NONE | 1000 | 32 | 112x | 35.622 ms | 1.25% | 35.613 ms | 1.25% | 15074930029 | 533.921 MiB | 30.799 MiB |
| HOST_BUFFER | NONE | 1000 | 32 | 409x | 36.558 ms | 1.97% | 36.549 ms | 1.97% | 14689106036 | 533.921 MiB | 30.799 MiB |
| DEVICE_BUFFER | NONE | 1000 | 32 | 272x | 33.879 ms | 1.07% | 33.870 ms | 1.07% | 15850926773 | 533.921 MiB | 30.799 MiB |
On the other hand, not observing a clear signal with the multithreaded Parquet benchmark, even in the single-threaded cases. Something we'll want to investigate as we look further into multithreaded scaling.
/merge