cudf
cudf copied to clipboard
Use stod in cuio floating point parsing
Fixes #10599 Parsing string to float is inconsistent between CSV reader and to_numeric. This is because cuio parsing and cudf.to_numeric uses different algorithms.
- cuio parsing uses
parse_numeric
. This is written such that it works on both integrals and floating point numbers. - to_numeric in convert_floats.cu uses
stod
. This can return partial result if parsing encounters invalid float format.
stod
is very close to the C++ STL stod
function implementation.
This PR updated the parsing code to use modified version of stod
.
stod
can take optional parameters such as error_result, decimal_char, thousands_char (documented).
Benchmark comparisons. BENCH_JSONs.zip commit ff63c0a745382d79eaf0ae45931e7167d9763b88 (HEAD -> branch-22.08) vs This PR branch.
./_deps/benchmark-src/tools/compare.py benchmarks branch-22.08/CSV_READER_BENCH.json enh-test_float_parse/CSV_READER_BENCH.json
Comparing branch-22.08/CSV_READER_BENCH.json to enh-test_float_parse/CSV_READER_BENCH.json
Benchmark Time CPU Time Old Time New CPU Old CPU New
-----------------------------------------------------------------------------------------------------------------------------------------------
CsvRead/integral_file_input/29/0/manual_time +0.1702 +0.1698 224 262 224 262
CsvRead/integral_buffer_input/29/1/manual_time +0.1761 +0.1779 196 231 196 231
CsvRead/floats_file_input/31/0/manual_time +0.2110 +0.2109 252 305 252 305
CsvRead/floats_buffer_input/31/1/manual_time +0.2202 +0.2180 226 276 226 276
CsvRead/decimal_file_input/35/0/manual_time +0.1168 +0.1168 140 156 140 156
CsvRead/decimal_buffer_input/35/1/manual_time +0.0349 +0.0340 117 122 117 121
CsvRead/timestamps_file_input/33/0/manual_time +0.1542 +0.1544 486 561 486 561
CsvRead/timestamps_buffer_input/33/1/manual_time +0.0977 +0.0979 437 480 437 480
CsvRead/durations_file_input/34/0/manual_time +0.0814 +0.0811 627 678 627 678
CsvRead/durations_buffer_input/34/1/manual_time +0.0933 +0.0932 555 607 555 607
CsvRead/string_file_input/23/0/manual_time +0.0367 +0.0361 142 147 142 147
CsvRead/string_buffer_input/23/1/manual_time +0.0267 +0.0270 122 126 122 126
CsvRead/column_selection/0/0/1/manual_time +0.1039 +0.1038 285 314 284 314
CsvRead/column_selection/1/0/1/manual_time +0.1197 +0.1197 203 227 203 227
CsvRead/column_selection/2/0/1/manual_time +0.1219 +0.1207 206 231 206 231
CsvRead/column_selection/3/0/1/manual_time +0.1187 +0.1208 207 232 207 232
CsvRead/row_selection/0/1/1/manual_time +0.0994 +0.1007 283 311 283 311
CsvRead/row_selection/0/2/1/manual_time +0.0962 +0.0962 283 310 283 310
CsvRead/row_selection/0/3/1/manual_time +0.1002 +0.1002 282 311 282 311
CsvRead/row_selection/0/1/8/manual_time +0.1315 +0.1331 263 297 262 297
CsvRead/row_selection/0/2/8/manual_time +0.0831 +0.0826 603 653 602 652
CsvRead/row_selection/0/3/8/manual_time +0.0681 +0.0673 922 985 922 984
./_deps/benchmark-src/tools/compare.py benchmarks branch-22.08/STRINGS_BENCH.json enh-test_float_parse/STRINGS_BENCH.json
StringsToNumeric/strings_to_float32/1024/manual_time +0.0517 +0.0317 26 27 45 46
StringsToNumeric/strings_to_float32/4096/manual_time +0.0416 +0.0265 26 27 45 46
StringsToNumeric/strings_to_float32/16384/manual_time +0.0535 +0.0356 28 29 47 48
StringsToNumeric/strings_to_float32/65536/manual_time +0.0561 +0.0389 30 32 49 50
StringsToNumeric/strings_to_float32/131072/manual_time +0.1140 +0.0758 34 38 53 57
StringsToNumeric/strings_to_float64/1024/manual_time +0.0513 +0.0336 26 27 45 47
StringsToNumeric/strings_to_float64/4096/manual_time +0.0440 +0.0259 27 28 46 47
StringsToNumeric/strings_to_float64/16384/manual_time +0.0455 +0.0293 28 29 47 48
StringsToNumeric/strings_to_float64/65536/manual_time +0.0508 +0.0327 30 32 49 51
StringsToNumeric/strings_to_float64/131072/manual_time +0.1082 +0.0696 35 38 54 57
rerun tests
At this point I think our options are:
- pause for now and revisit later during future refactoring
- take a closer look at a performance comparison of
strings::convert::convert_floats::stod
versusio::utilities::parsing_utils::parse_numeric
, and focus on makingstrings::convert::convert_floats::stod
fast enough to ultimately be useful in cuIO@davidwendt would you please share your thoughts on 2.?
IMO. The difference in performance may because of the incompleteness of the original parse_numeric
which did not account for some boundary conditions in was less accurate in its calculation. Also, it appears that error-checking logic is included in the stod
version defined in parsing_utils.cuh
. This extra logic could also possibly slow down the execution. Maybe I'm interpreting the benchmark results incorrectly.
I guess I thought that the task here was to have only a single stod
function used by both the parse_numeric
and the convert-to-float
functions. It now appears we still have 2 unique code paths but they are just both named stod
now? Perhaps that solution caused unacceptable performance (or maintenance?) issues and was scrapped? If so, is the PR just for increasing the accuracy and correctness of parse_numeric
?
Sorry I have an off-topic comment: Should stod
is an important enough feature that should be implemented more carefully (i.e., fully coveraged for boundary conditions) at a lower level like in libcu++ or even as a native function in CUDA sdk?
Sorry I have an off-topic comment: Should
stod
is an important enough feature that should be implemented more carefully (i.e., fully coveraged for boundary conditions) at a lower level like in libcu++ or even as a native function in CUDA sdk?
Yes, once libcu++ starts adding support for strings I suspect they will be copying this function from us.
This PR has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d
if there is no activity in the next 60 days.
closing the PR. will open again if this is requested.