cudf
cudf copied to clipboard
Add JSON option to use dtypes as Filter
Description
Resolves https://github.com/rapidsai/cudf/issues/14951
This adds an option use_dtypes_as_filter to json_reader_options (default False)
When set to True, the dtypes is used as filter instead of type inference suggestion. If dtypes (vector of dtypes, map of dtypes or nested schema), is not specified, output is empty dataframe.
Checklist
- [x] I am familiar with the Contributing Guidelines.
- [x] New or existing tests cover these changes.
- [ ] The documentation is up to date with these changes.
Profiled on GV100 machine.
Reading JSON with 512 columns, 10k rows without filter
Reading 1 columns out of JSON with 512 columns, 10k rows. ~(with filter 1 row)~ (with filter 1 column)
unnecesary parse_data() calls are eliminated.
It's possible to eliminate the initialize_json_columns() calls as well (but runtime impact is less, memory usage will reduce, and depends on map type PR #14936)
Thank you @karthikeyann, this is a great demonstration! When you mention:
Reading 1 columns out of JSON with 512 columns, 10k rows. (with filter 1 row)
What do you mean by "filter 1 row"?
What do you mean by "filter 1 row"?
Sorry. I meant to type "filter 1 column".
keys.json content in each line:
{"key_109": "value0", "key_200": "value0", "key_342": "value0", ... } (500 keys out of 512 columns in each row)
import cudf
import nvtx
# read all 512 columns
with nvtx.annotate("read_json", color="purple"):
df = cudf.read_json(open("keys.json"), engine="cudf", lines=True)
# read only 1 column
with nvtx.annotate("read_json", color="purple"):
df = cudf.read_json(open("keys.json"), engine="cudf", lines=True, dtype={"key_10": str}, use_dtypes_as_filter=True)
This pull request requires additional validation before any workflows can run on NVIDIA's runners.
Pull request vetters can view their responsibilities here.
Contributors can view more details about this message here.
/ok to test
/ok to test
/ok to test
eliminated the initialize_json_columns for filtered columns.
first read_json is without filter. second read_json is with filter enabled.
The numbers are for relative comparison only.
/ok to test
/ok to test
/ok to test
/ok to test
do we need to include this option in the java code as well?
Yes. @revans2 Should I include the java code changes as well in this PR?
/merge
/ok to test