cudf icon indicating copy to clipboard operation
cudf copied to clipboard

Enable `get_token_stream` to include `LineEnd` tokens with optional parameter.

Open SurajAralihalli opened this issue 1 year ago • 1 comments
trafficstars

Description

This PR adds parameter LineEndTokenOption to the get_token_stream and process_token_stream functions, enabling LineEnd tokens in the output. Also retained original declaration of get_token_stream to maintain backward compatibility.

Checklist

  • [x] I am familiar with the Contributing Guidelines.
  • [x] New or existing tests cover these changes.
  • [x] The documentation is up to date with these changes.

SurajAralihalli avatar Apr 26 '24 23:04 SurajAralihalli

I and @shrshi discussed about a profile of @revans2 's prototype https://github.com/revans2/spark-rapids-jni/pull/new/get_json_obj_experiment.CUDF image

A few outcomes of our meeting:

  • 5 simulateDFA calls in total.
  • A few smaller FST take longer time than get_stack_context FST. will investigate further.
  • Consider fusing recover_from_error with PDA (json_to_tokens_fst FST).
  • It is possible to eliminate the process_token_stream function itself, if the post-processing code of tokens after get_token_stream, can handle Error tokens. That new code is going to process tokens row-wise, so it will be easier to nullify the entire row.
    • @SurajAralihalli can the new post processing code of tokens replace Error tokens with null row?
    • @shrshi suggested, FST may not be required to implement the process_token_stream anyway, since it's a stream compaction always. (Related PR https://github.com/rapidsai/cudf/pull/13344)

karthikeyann avatar May 08 '24 02:05 karthikeyann