pybench
pybench copied to clipboard
`nrows` in `benchmark_dataframe.py` might skew results for test_Read_CSV
nrows
in benchmark_dataframe.py might skew results for test_Read_CSV
We currently use nrows
for parameterizing reading dataframes
which might skew results for smaller files in test_Read_CSV
.
https://github.com/pentschev/pybench/blob/89d65a6c418a1fee39d447bd11b8a999835b74a9/pybench/benchmarks/benchmark_dataframe.py#L16
In the current cudf
implementation, with nrows
i think we are parsing the entire file in gpu memory to find line terminators (and quote characters).
This might skew our results for reading with nrows
so we might want to change it.
See comment : https://github.com/rapidsai/cudf/issues/1643#issuecomment-492490919 and https://github.com/rapidsai/cudf/issues/1643#issuecomment-492848687 on issue: https://github.com/rapidsai/cudf/issues/1643
On the below tests i found quite a bit of performance delta, (72.8 ms
vs 1.6 s
)
Take head of the file for reading:
import cudf
!head -n 100001 '/datasets/nyc_taxi/2015/yellow_tripdata_2015-01.csv' > 'yellow_tripdata_2015-01_head_100k.csv'
Timing on reading from a small file file:
%timeit df = cudf.read_csv('yellow_tripdata_2015-01_head_100k.csv',nrows = 100_000)
72.8 ms ± 28.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Timing on reading the whole file:
%timeit df = cudf.read_csv('/datasets/nyc_taxi/2015/yellow_tripdata_2015-01.csv',nrows = 100_000)
1.6 s ± 45.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)