pybench icon indicating copy to clipboard operation
pybench copied to clipboard

`nrows` in `benchmark_dataframe.py` might skew results for test_Read_CSV

Open VibhuJawa opened this issue 5 years ago • 0 comments

nrows in benchmark_dataframe.py might skew results for test_Read_CSV

We currently use nrows for parameterizing reading dataframes which might skew results for smaller files in test_Read_CSV.

https://github.com/pentschev/pybench/blob/89d65a6c418a1fee39d447bd11b8a999835b74a9/pybench/benchmarks/benchmark_dataframe.py#L16

In the current cudf implementation, with nrows i think we are parsing the entire file in gpu memory to find line terminators (and quote characters).

This might skew our results for reading with nrows so we might want to change it.

See comment : https://github.com/rapidsai/cudf/issues/1643#issuecomment-492490919 and https://github.com/rapidsai/cudf/issues/1643#issuecomment-492848687 on issue: https://github.com/rapidsai/cudf/issues/1643

On the below tests i found quite a bit of performance delta, (72.8 ms vs 1.6 s)

Take head of the file for reading:
import cudf
!head -n 100001 '/datasets/nyc_taxi/2015/yellow_tripdata_2015-01.csv' > 'yellow_tripdata_2015-01_head_100k.csv'
Timing on reading from a small file file:
%timeit df = cudf.read_csv('yellow_tripdata_2015-01_head_100k.csv',nrows = 100_000)
72.8 ms ± 28.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Timing on reading the whole file:
%timeit df = cudf.read_csv('/datasets/nyc_taxi/2015/yellow_tripdata_2015-01.csv',nrows = 100_000)
1.6 s ± 45.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

VibhuJawa avatar Aug 19 '19 22:08 VibhuJawa