cudf icon indicating copy to clipboard operation
cudf copied to clipboard

Adds JSON-token-stream to JSON-tree

Open elstehle opened this issue 2 years ago • 1 comments

This PR adds the stage of converting a JSON input into a tree representation, where each node represents either a struct, a list, a field name, a string value, a value, or an error node.

The PR is part of a multi-part PR-chain. Specifically, this PR builds on the JSON tokenizer PR.

This PR depends on: ⛓️ https://github.com/rapidsai/cudf/pull/11264 ⛓️ https://github.com/rapidsai/cudf/pull/11242 ⛓️ https://github.com/rapidsai/cudf/pull/11078

Each node has one of the following category:

/// A node representing a struct
NC_STRUCT,
/// A node representing a list
NC_LIST,
/// A node representing a field name
NC_FN,
/// A node representing a string value
NC_STR,
/// A node representing a numeric or literal value (e.g., true, false, null)
NC_VAL,
/// A node representing a parser error
NC_ERR

For each node, the tree representation stores the following information:

  • node category
  • node level
  • node range begin (index of the first character from the original JSON input that this node demarcates)
  • node range end (index of one-past-the-last-character of the first character from the original JSON input that this node demarcates)

An example tree: The following is just an example print of the information represented in the tree generated by the algorithm.

  • Each line is printing the full path to the next node in the tree.
  • For each node along the path we have the following format: <[NODE_ID]:[NODE_CATEGORY]:[[RANGE_BEGIN],[RANGE_END]) '[STRING_FROM_RANGE]'>

The original JSON for this tree:

  [{"category": "reference","index:": [4,12,42],"author": "Nigel Rees","title": "[Sayings of the Century]","price": 8.95},  {"category": "reference","index": [4,{},null,{"a":[{ }, {}] } ],"author": "Nigel Rees","title": "{}[], <=semantic-symbols-string","price": 8.95}] 

The tree:

<0:LIST:[2, 3) '['>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <2:FN:[5, 13) 'category'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <2:FN:[5, 13) 'category'> -> <3:STR:[17, 26) 'reference'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <4:FN:[29, 35) 'index:'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <4:FN:[29, 35) 'index:'> -> <5:LIST:[38, 39) '['>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <4:FN:[29, 35) 'index:'> -> <5:LIST:[38, 39) '['> -> <6:VAL:[39, 40) '4'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <4:FN:[29, 35) 'index:'> -> <5:LIST:[38, 39) '['> -> <7:VAL:[41, 43) '12'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <4:FN:[29, 35) 'index:'> -> <5:LIST:[38, 39) '['> -> <8:VAL:[44, 46) '42'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <9:FN:[49, 55) 'author'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <9:FN:[49, 55) 'author'> -> <10:STR:[59, 69) 'Nigel Rees'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <11:FN:[72, 77) 'title'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <11:FN:[72, 77) 'title'> -> <12:STR:[81, 105) '[Sayings of the Century]'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <13:FN:[108, 113) 'price'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <13:FN:[108, 113) 'price'> -> <14:VAL:[116, 120) '8.95'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <16:FN:[126, 134) 'category'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <16:FN:[126, 134) 'category'> -> <17:STR:[138, 147) 'reference'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <20:VAL:[159, 160) '4'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <21:STRUCT:[161, 162) '{'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <22:VAL:[164, 168) 'null'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <23:STRUCT:[169, 170) '{'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <23:STRUCT:[169, 170) '{'> -> <24:FN:[171, 172) 'a'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <23:STRUCT:[169, 170) '{'> -> <24:FN:[171, 172) 'a'> -> <25:LIST:[174, 175) '['>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <23:STRUCT:[169, 170) '{'> -> <24:FN:[171, 172) 'a'> -> <25:LIST:[174, 175) '['> -> <26:STRUCT:[175, 176) '{'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <23:STRUCT:[169, 170) '{'> -> <24:FN:[171, 172) 'a'> -> <25:LIST:[174, 175) '['> -> <27:STRUCT:[180, 181) '{'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <28:FN:[189, 195) 'author'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <28:FN:[189, 195) 'author'> -> <29:STR:[199, 209) 'Nigel Rees'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <30:FN:[212, 217) 'title'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <30:FN:[212, 217) 'title'> -> <31:STR:[221, 252) '{}[], <=semantic-symbols-string'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <32:FN:[255, 260) 'price'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <32:FN:[255, 260) 'price'> -> <33:VAL:[263, 267) '8.95'>

The original JSON pretty-printed for this tree:

[
    {
        "category": "reference",
        "index:": [
            4,
            12,
            42
        ],
        "author": "Nigel Rees",
        "title": "[Sayings of the Century]",
        "price": 8.95
    },
    {
        "category": "reference",
        "index": [
            4,
            {},
            null,
            {
                "a": [
                    {},
                    {}
                ]
            }
        ],
        "author": "Nigel Rees",
        "title": "{}[], <=semantic-symbols-string",
        "price": 8.95
    }
]

elstehle avatar Jul 18 '22 17:07 elstehle

Codecov Report

:exclamation: No coverage uploaded for pull request base (branch-22.10@a67b718). Click here to learn what that means. Patch has no changes to coverable lines.

Additional details and impacted files
@@               Coverage Diff               @@
##             branch-22.10   #11291   +/-   ##
===============================================
  Coverage                ?   86.48%           
===============================================
  Files                   ?      145           
  Lines                   ?    22840           
  Branches                ?        0           
===============================================
  Hits                    ?    19753           
  Misses                  ?     3087           
  Partials                ?        0           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

codecov[bot] avatar Jul 26 '22 10:07 codecov[bot]

This PR has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] avatar Sep 11 '22 00:09 github-actions[bot]