cudf icon indicating copy to clipboard operation
cudf copied to clipboard

Reading multi-line JSON in string columns using runtime configurable delimiter

Open shrshi opened this issue 1 year ago • 2 comments

Description

Addresses #15277 Given a JSON lines buffer with records separated by a delimiter passed at runtime, the idea is to modify the JSON tokenization FST to consider the passed delimiter to generate EOL token instead of the newline character currently hard-coded. This PR does not modify the whitespace normalization FST to strip out unquoted \n and \r. Whitespace normalization will be handled in follow-up works. Note that this is not a multi-object JSON reader since we are not using the offsets data in the string column, and hence there is no resetting of the start state at every row offset.

Current status:

  • [X] Semantic bracket/brace DFA
  • [X] DFA removing excess characters after record in line
  • [X] Pushdown automata generating tokens
  • [x] Test passing arbitrary delimiter that does not occur in input to the reader

Checklist

  • [X] I am familiar with the Contributing Guidelines.
  • [x] New or existing tests cover these changes.
  • [ ] The documentation is up to date with these changes.

shrshi avatar Apr 17 '24 19:04 shrshi

If we translate delim to \n in the to_stack_op automata, I agree that we would not have to make changes to the logical stack. However if there are \n characters in the input JSON, then I think the PDA FST will need to recognize the context in which the newline occurs (from the logical stack?) and decide if it is has been translated from delim or not. I'm not sure how involved the context detection logic would be in this case. What do you think?

I'd be curious to hear your thoughts on leaving \n as the delimiter on the stack context array. To my understanding, that would make changes in the logical stack and - hopefully - the PDA-FST superfluous(?). Working on this you probably have better insight if that's the case.

Did you get a chance to think through whether we could just keep the original translation table and just translate delim to \n and keep feeding that to the logical stack.

shrshi avatar Apr 29 '24 23:04 shrshi

If we translate delim to \n in the to_stack_op automata, I agree that we would not have to make changes to the logical stack. However if there are \n characters in the input JSON, then I think the PDA FST will need to recognize the context in which the newline occurs (from the logical stack?) and decide if it is has been translated from delim or not. I'm not sure how involved the context detection logic would be in this case. What do you think?

I'd be curious to hear your thoughts on leaving \n as the delimiter on the stack context array. To my understanding, that would make changes in the logical stack and - hopefully - the PDA-FST superfluous(?). Working on this you probably have better insight if that's the case.

Did you get a chance to think through whether we could just keep the original translation table and just translate delim to \n and keep feeding that to the logical stack.

Sorry, missed your reply. To be clear, what I mean is to only translate the stack context (i.e., stack_symbols array in the get_token_stream function) from delim to \n and not modify the JSON input itself.

elstehle avatar May 06 '24 06:05 elstehle

/merge

shrshi avatar May 20 '24 16:05 shrshi