cudf icon indicating copy to clipboard operation
cudf copied to clipboard

Use stod in cuio floating point parsing

Open karthikeyann opened this issue 2 years ago • 6 comments

Fixes #10599 Parsing string to float is inconsistent between CSV reader and to_numeric. This is because cuio parsing and cudf.to_numeric uses different algorithms.

  • cuio parsing uses parse_numeric. This is written such that it works on both integrals and floating point numbers.
  • to_numeric in convert_floats.cu uses stod. This can return partial result if parsing encounters invalid float format.

stod is very close to the C++ STL stod function implementation.

This PR updated the parsing code to use modified version of stod.
stod can take optional parameters such as error_result, decimal_char, thousands_char (documented).

karthikeyann avatar Jul 01 '22 17:07 karthikeyann

Benchmark comparisons. BENCH_JSONs.zip commit ff63c0a745382d79eaf0ae45931e7167d9763b88 (HEAD -> branch-22.08) vs This PR branch.

./_deps/benchmark-src/tools/compare.py benchmarks branch-22.08/CSV_READER_BENCH.json enh-test_float_parse/CSV_READER_BENCH.json 
Comparing branch-22.08/CSV_READER_BENCH.json to enh-test_float_parse/CSV_READER_BENCH.json
Benchmark                                                          Time             CPU      Time Old      Time New       CPU Old       CPU New
-----------------------------------------------------------------------------------------------------------------------------------------------
CsvRead/integral_file_input/29/0/manual_time                    +0.1702         +0.1698           224           262           224           262
CsvRead/integral_buffer_input/29/1/manual_time                  +0.1761         +0.1779           196           231           196           231
CsvRead/floats_file_input/31/0/manual_time                      +0.2110         +0.2109           252           305           252           305
CsvRead/floats_buffer_input/31/1/manual_time                    +0.2202         +0.2180           226           276           226           276
CsvRead/decimal_file_input/35/0/manual_time                     +0.1168         +0.1168           140           156           140           156
CsvRead/decimal_buffer_input/35/1/manual_time                   +0.0349         +0.0340           117           122           117           121
CsvRead/timestamps_file_input/33/0/manual_time                  +0.1542         +0.1544           486           561           486           561
CsvRead/timestamps_buffer_input/33/1/manual_time                +0.0977         +0.0979           437           480           437           480
CsvRead/durations_file_input/34/0/manual_time                   +0.0814         +0.0811           627           678           627           678
CsvRead/durations_buffer_input/34/1/manual_time                 +0.0933         +0.0932           555           607           555           607
CsvRead/string_file_input/23/0/manual_time                      +0.0367         +0.0361           142           147           142           147
CsvRead/string_buffer_input/23/1/manual_time                    +0.0267         +0.0270           122           126           122           126
CsvRead/column_selection/0/0/1/manual_time                      +0.1039         +0.1038           285           314           284           314
CsvRead/column_selection/1/0/1/manual_time                      +0.1197         +0.1197           203           227           203           227
CsvRead/column_selection/2/0/1/manual_time                      +0.1219         +0.1207           206           231           206           231
CsvRead/column_selection/3/0/1/manual_time                      +0.1187         +0.1208           207           232           207           232
CsvRead/row_selection/0/1/1/manual_time                         +0.0994         +0.1007           283           311           283           311
CsvRead/row_selection/0/2/1/manual_time                         +0.0962         +0.0962           283           310           283           310
CsvRead/row_selection/0/3/1/manual_time                         +0.1002         +0.1002           282           311           282           311
CsvRead/row_selection/0/1/8/manual_time                         +0.1315         +0.1331           263           297           262           297
CsvRead/row_selection/0/2/8/manual_time                         +0.0831         +0.0826           603           653           602           652
CsvRead/row_selection/0/3/8/manual_time                         +0.0681         +0.0673           922           985           922           984
./_deps/benchmark-src/tools/compare.py benchmarks branch-22.08/STRINGS_BENCH.json enh-test_float_parse/STRINGS_BENCH.json
StringsToNumeric/strings_to_float32/1024/manual_time                             +0.0517         +0.0317            26            27            45            46
StringsToNumeric/strings_to_float32/4096/manual_time                             +0.0416         +0.0265            26            27            45            46
StringsToNumeric/strings_to_float32/16384/manual_time                            +0.0535         +0.0356            28            29            47            48
StringsToNumeric/strings_to_float32/65536/manual_time                            +0.0561         +0.0389            30            32            49            50
StringsToNumeric/strings_to_float32/131072/manual_time                           +0.1140         +0.0758            34            38            53            57
StringsToNumeric/strings_to_float64/1024/manual_time                             +0.0513         +0.0336            26            27            45            47
StringsToNumeric/strings_to_float64/4096/manual_time                             +0.0440         +0.0259            27            28            46            47
StringsToNumeric/strings_to_float64/16384/manual_time                            +0.0455         +0.0293            28            29            47            48
StringsToNumeric/strings_to_float64/65536/manual_time                            +0.0508         +0.0327            30            32            49            51
StringsToNumeric/strings_to_float64/131072/manual_time                           +0.1082         +0.0696            35            38            54            57

karthikeyann avatar Jul 01 '22 21:07 karthikeyann

rerun tests

karthikeyann avatar Jul 01 '22 21:07 karthikeyann

At this point I think our options are:

  1. pause for now and revisit later during future refactoring
  2. take a closer look at a performance comparison of strings::convert::convert_floats::stod versus io::utilities::parsing_utils::parse_numeric, and focus on making strings::convert::convert_floats::stod fast enough to ultimately be useful in cuIO

@davidwendt would you please share your thoughts on 2.?

IMO. The difference in performance may because of the incompleteness of the original parse_numeric which did not account for some boundary conditions in was less accurate in its calculation. Also, it appears that error-checking logic is included in the stod version defined in parsing_utils.cuh. This extra logic could also possibly slow down the execution. Maybe I'm interpreting the benchmark results incorrectly.

davidwendt avatar Jul 05 '22 12:07 davidwendt

I guess I thought that the task here was to have only a single stod function used by both the parse_numeric and the convert-to-float functions. It now appears we still have 2 unique code paths but they are just both named stod now? Perhaps that solution caused unacceptable performance (or maintenance?) issues and was scrapped? If so, is the PR just for increasing the accuracy and correctness of parse_numeric?

davidwendt avatar Jul 05 '22 12:07 davidwendt

Sorry I have an off-topic comment: Should stod is an important enough feature that should be implemented more carefully (i.e., fully coveraged for boundary conditions) at a lower level like in libcu++ or even as a native function in CUDA sdk?

ttnghia avatar Jul 14 '22 19:07 ttnghia

Sorry I have an off-topic comment: Should stod is an important enough feature that should be implemented more carefully (i.e., fully coveraged for boundary conditions) at a lower level like in libcu++ or even as a native function in CUDA sdk?

Yes, once libcu++ starts adding support for strings I suspect they will be copying this function from us.

davidwendt avatar Jul 14 '22 19:07 davidwendt

This PR has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] avatar Sep 01 '22 22:09 github-actions[bot]

closing the PR. will open again if this is requested.

karthikeyann avatar Feb 14 '23 12:02 karthikeyann