cudf icon indicating copy to clipboard operation
cudf copied to clipboard

[FEA] Performance improvements for csv-writer

Open davidwendt opened this issue 5 years ago • 5 comments

I've been working on improvements to the csv-writer. The changes may require multiple PRs and are as follows:

  1. The current implementation formats the CSV (in row chunks) into CPU memory before writing the file. Profiling should transposing the columns from device memory to host memory was taking more than half the total time to generate the file. Modifying the logic to create the format in device memory first and then copying to host before writing the file improved performance by 20-30%. This item requires no change to the API.
  2. When chunking the rows, writing the chunks to individual files did not provide performance improvement but generating multiple files may improve read speed. If this becomes an option, the code can launch separate CPU threads when writing each chunk from host memory. This provided a 2-3x speedup over creating a single output file. This item would require a new parameter to tell the csv-writer to create individual files for each chunk.
  3. After the first 2 are implemented, it would be possible to support gzip compression of the file chunks without too significant of a performance penalty. Adding gzip without these measures increased the write time 3-4x. This item would also require a new parameter indicating that compression is desired.

Recommend adding these improvements in order since each subsequent item gets its advantage from the previous. The first item also makes GDS an option for speeding up the actual file write since copying the data to the host would not be required.

davidwendt avatar Aug 23 '19 21:08 davidwendt