cudf
cudf copied to clipboard
Adds JSON-token-stream to JSON-tree
This PR adds the stage of converting a JSON input into a tree representation, where each node represents either a struct, a list, a field name, a string value, a value, or an error node.
The PR is part of a multi-part PR-chain. Specifically, this PR builds on the JSON tokenizer PR.
This PR depends on: ⛓️ https://github.com/rapidsai/cudf/pull/11264 ⛓️ https://github.com/rapidsai/cudf/pull/11242 ⛓️ https://github.com/rapidsai/cudf/pull/11078
Each node has one of the following category:
/// A node representing a struct
NC_STRUCT,
/// A node representing a list
NC_LIST,
/// A node representing a field name
NC_FN,
/// A node representing a string value
NC_STR,
/// A node representing a numeric or literal value (e.g., true, false, null)
NC_VAL,
/// A node representing a parser error
NC_ERR
For each node, the tree representation stores the following information:
- node category
- node level
- node range begin (index of the first character from the original JSON input that this node demarcates)
- node range end (index of one-past-the-last-character of the first character from the original JSON input that this node demarcates)
An example tree: The following is just an example print of the information represented in the tree generated by the algorithm.
- Each line is printing the full path to the next node in the tree.
- For each node along the path we have the following format:
<[NODE_ID]:[NODE_CATEGORY]:[[RANGE_BEGIN],[RANGE_END]) '[STRING_FROM_RANGE]'>
The original JSON for this tree:
[{"category": "reference","index:": [4,12,42],"author": "Nigel Rees","title": "[Sayings of the Century]","price": 8.95}, {"category": "reference","index": [4,{},null,{"a":[{ }, {}] } ],"author": "Nigel Rees","title": "{}[], <=semantic-symbols-string","price": 8.95}]
The tree:
<0:LIST:[2, 3) '['>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <2:FN:[5, 13) 'category'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <2:FN:[5, 13) 'category'> -> <3:STR:[17, 26) 'reference'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <4:FN:[29, 35) 'index:'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <4:FN:[29, 35) 'index:'> -> <5:LIST:[38, 39) '['>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <4:FN:[29, 35) 'index:'> -> <5:LIST:[38, 39) '['> -> <6:VAL:[39, 40) '4'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <4:FN:[29, 35) 'index:'> -> <5:LIST:[38, 39) '['> -> <7:VAL:[41, 43) '12'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <4:FN:[29, 35) 'index:'> -> <5:LIST:[38, 39) '['> -> <8:VAL:[44, 46) '42'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <9:FN:[49, 55) 'author'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <9:FN:[49, 55) 'author'> -> <10:STR:[59, 69) 'Nigel Rees'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <11:FN:[72, 77) 'title'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <11:FN:[72, 77) 'title'> -> <12:STR:[81, 105) '[Sayings of the Century]'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <13:FN:[108, 113) 'price'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <13:FN:[108, 113) 'price'> -> <14:VAL:[116, 120) '8.95'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <16:FN:[126, 134) 'category'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <16:FN:[126, 134) 'category'> -> <17:STR:[138, 147) 'reference'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <20:VAL:[159, 160) '4'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <21:STRUCT:[161, 162) '{'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <22:VAL:[164, 168) 'null'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <23:STRUCT:[169, 170) '{'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <23:STRUCT:[169, 170) '{'> -> <24:FN:[171, 172) 'a'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <23:STRUCT:[169, 170) '{'> -> <24:FN:[171, 172) 'a'> -> <25:LIST:[174, 175) '['>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <23:STRUCT:[169, 170) '{'> -> <24:FN:[171, 172) 'a'> -> <25:LIST:[174, 175) '['> -> <26:STRUCT:[175, 176) '{'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <23:STRUCT:[169, 170) '{'> -> <24:FN:[171, 172) 'a'> -> <25:LIST:[174, 175) '['> -> <27:STRUCT:[180, 181) '{'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <28:FN:[189, 195) 'author'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <28:FN:[189, 195) 'author'> -> <29:STR:[199, 209) 'Nigel Rees'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <30:FN:[212, 217) 'title'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <30:FN:[212, 217) 'title'> -> <31:STR:[221, 252) '{}[], <=semantic-symbols-string'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <32:FN:[255, 260) 'price'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <32:FN:[255, 260) 'price'> -> <33:VAL:[263, 267) '8.95'>
The original JSON pretty-printed for this tree:
[
{
"category": "reference",
"index:": [
4,
12,
42
],
"author": "Nigel Rees",
"title": "[Sayings of the Century]",
"price": 8.95
},
{
"category": "reference",
"index": [
4,
{},
null,
{
"a": [
{},
{}
]
}
],
"author": "Nigel Rees",
"title": "{}[], <=semantic-symbols-string",
"price": 8.95
}
]
Codecov Report
:exclamation: No coverage uploaded for pull request base (
branch-22.10@a67b718
). Click here to learn what that means. Patch has no changes to coverable lines.
Additional details and impacted files
@@ Coverage Diff @@
## branch-22.10 #11291 +/- ##
===============================================
Coverage ? 86.48%
===============================================
Files ? 145
Lines ? 22840
Branches ? 0
===============================================
Hits ? 19753
Misses ? 3087
Partials ? 0
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
This PR has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d
if there is no activity in the next 60 days.