cudf icon indicating copy to clipboard operation
cudf copied to clipboard

JSON tree traversal

Open karthikeyann opened this issue 3 years ago • 1 comments
trafficstars

Adds JSON tree traversal algorithm in host and device.

It generates column indices for record orient json format. List of structs at root, where each struct is a row.

  • [x] column indices generation
  • [x] row offset

Depends on PR https://github.com/rapidsai/cudf/pull/11518

Tree Traversal

This algorithm assigns a unique column id to each node in the tree. The row offset is the row index of the node in that column id. Algorithm:

  1. Convert node_category+fieldname to node_type. a. Create a hashmap to hash field name and assign unique node id as values. b. Convert the node categories to node types. Node type is defined as node category enum value if it is not a field node, otherwise it is the unique node id assigned by the hashmap (value shifted by #NUM_CATEGORY).
  2. Preprocessing: Translate parent node ids after sorting by level. a. sort by level b. get gather map of sorted indices c. translate parent_node_ids to new sorted indices
  3. Find level boundaries. copy_if index of first unique values of sorted levels.
  4. Per-Level Processing: Propagate parent node ids for each level. For each level, a. gather col_id from previous level results. input=col_id, gather_map is parent_indices. b. stable sort by {parent_col_id, node_type} c. scan sum of unique {parent_col_id, node_type} d. scatter the col_id back to stable node_level order (using scatter_indices) Restore original node_id order
  5. Generate row_offset. a. stable_sort by parent_col_id. b. scan_by_key {parent_col_id} (required only on nodes who's parent is list) c. propagate to non-list leaves from parent list node by recursion

Checklist

  • [x] I am familiar with the Contributing Guidelines.
  • [x] New or existing tests cover these changes.
  • [x] The documentation is up to date with these changes.

karthikeyann avatar Aug 26 '22 19:08 karthikeyann

Codecov Report

:exclamation: No coverage uploaded for pull request base (branch-22.10@9a5f39a). Click here to learn what that means. Patch has no changes to coverable lines.

Additional details and impacted files
@@               Coverage Diff               @@
##             branch-22.10   #11610   +/-   ##
===============================================
  Coverage                ?   87.52%           
===============================================
  Files                   ?      133           
  Lines                   ?    21774           
  Branches                ?        0           
===============================================
  Hits                    ?    19058           
  Misses                  ?     2716           
  Partials                ?        0           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

codecov[bot] avatar Aug 26 '22 21:08 codecov[bot]

Thank you @elstehle @davidwendt @bdice @upsj @PointKernel ! 🎉 The PR got shaped into wonderful code. Always new perspective from each of the reviewers. Learned a lot. Great work! 👍

karthikeyann avatar Sep 23 '22 19:09 karthikeyann

@gpucibot merge

karthikeyann avatar Sep 24 '22 12:09 karthikeyann