cudf
cudf copied to clipboard
Enable `get_token_stream` to include `LineEnd` tokens with optional parameter.
trafficstars
Description
This PR adds parameter LineEndTokenOption to the get_token_stream and process_token_stream functions, enabling LineEnd tokens in the output. Also retained original declaration of get_token_stream to maintain backward compatibility.
Checklist
- [x] I am familiar with the Contributing Guidelines.
- [x] New or existing tests cover these changes.
- [x] The documentation is up to date with these changes.
I and @shrshi discussed about a profile of @revans2 's prototype https://github.com/revans2/spark-rapids-jni/pull/new/get_json_obj_experiment.CUDF
A few outcomes of our meeting:
- 5 simulateDFA calls in total.
- A few smaller FST take longer time than get_stack_context FST. will investigate further.
- Consider fusing
recover_from_errorwith PDA (json_to_tokens_fstFST). - It is possible to eliminate the
process_token_streamfunction itself, if the post-processing code of tokens afterget_token_stream, can handle Error tokens. That new code is going to process tokens row-wise, so it will be easier to nullify the entire row.- @SurajAralihalli can the new post processing code of tokens replace Error tokens with null row?
- @shrshi suggested, FST may not be required to implement the
process_token_streamanyway, since it's a stream compaction always. (Related PR https://github.com/rapidsai/cudf/pull/13344)