cudf
cudf copied to clipboard
JSON tree traversal
Adds JSON tree traversal algorithm in host and device.
It generates column indices for record orient json format. List of structs at root, where each struct is a row.
- [x] column indices generation
- [x] row offset
Depends on PR https://github.com/rapidsai/cudf/pull/11518
Tree Traversal
This algorithm assigns a unique column id to each node in the tree. The row offset is the row index of the node in that column id. Algorithm:
- Convert node_category+fieldname to node_type. a. Create a hashmap to hash field name and assign unique node id as values. b. Convert the node categories to node types. Node type is defined as node category enum value if it is not a field node, otherwise it is the unique node id assigned by the hashmap (value shifted by #NUM_CATEGORY).
- Preprocessing: Translate parent node ids after sorting by level. a. sort by level b. get gather map of sorted indices c. translate parent_node_ids to new sorted indices
- Find level boundaries. copy_if index of first unique values of sorted levels.
- Per-Level Processing: Propagate parent node ids for each level. For each level, a. gather col_id from previous level results. input=col_id, gather_map is parent_indices. b. stable sort by {parent_col_id, node_type} c. scan sum of unique {parent_col_id, node_type} d. scatter the col_id back to stable node_level order (using scatter_indices) Restore original node_id order
- Generate row_offset. a. stable_sort by parent_col_id. b. scan_by_key {parent_col_id} (required only on nodes who's parent is list) c. propagate to non-list leaves from parent list node by recursion
Checklist
- [x] I am familiar with the Contributing Guidelines.
- [x] New or existing tests cover these changes.
- [x] The documentation is up to date with these changes.
Codecov Report
:exclamation: No coverage uploaded for pull request base (
branch-22.10@9a5f39a). Click here to learn what that means. Patch has no changes to coverable lines.
Additional details and impacted files
@@ Coverage Diff @@
## branch-22.10 #11610 +/- ##
===============================================
Coverage ? 87.52%
===============================================
Files ? 133
Lines ? 21774
Branches ? 0
===============================================
Hits ? 19058
Misses ? 2716
Partials ? 0
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
Thank you @elstehle @davidwendt @bdice @upsj @PointKernel ! 🎉 The PR got shaped into wonderful code. Always new perspective from each of the reviewers. Learned a lot. Great work! 👍
@gpucibot merge