cudf Add JSON option to use dtypes as Filter

trafficstars

Description

Resolves https://github.com/rapidsai/cudf/issues/14951 This adds an option use_dtypes_as_filter to json_reader_options (default False) When set to True, the dtypes is used as filter instead of type inference suggestion. If dtypes (vector of dtypes, map of dtypes or nested schema), is not specified, output is empty dataframe.

Checklist

[x] I am familiar with the Contributing Guidelines.
[x] New or existing tests cover these changes.
[ ] The documentation is up to date with these changes.

Feb 07 '24 17:02 karthikeyann

Profiled on GV100 machine. Reading JSON with 512 columns, 10k rows without filter

Reading 1 columns out of JSON with 512 columns, 10k rows. ~(with filter 1 row)~ (with filter 1 column)

unnecesary parse_data() calls are eliminated. It's possible to eliminate the initialize_json_columns() calls as well (but runtime impact is less, memory usage will reduce, and depends on map type PR #14936)

Feb 07 '24 17:02 karthikeyann

Thank you @karthikeyann, this is a great demonstration! When you mention:

Reading 1 columns out of JSON with 512 columns, 10k rows. (with filter 1 row)

What do you mean by "filter 1 row"?

Feb 07 '24 21:02 GregoryKimball

What do you mean by "filter 1 row"?

Sorry. I meant to type "filter 1 column".

keys.json content in each line: {"key_109": "value0", "key_200": "value0", "key_342": "value0", ... } (500 keys out of 512 columns in each row)

import cudf
import nvtx
# read all 512 columns
with nvtx.annotate("read_json", color="purple"):
    df = cudf.read_json(open("keys.json"), engine="cudf", lines=True)
# read only 1 column
with nvtx.annotate("read_json", color="purple"):
    df = cudf.read_json(open("keys.json"), engine="cudf", lines=True, dtype={"key_10": str}, use_dtypes_as_filter=True)

Feb 08 '24 06:02 karthikeyann

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Apr 08 '24 20:04 copy-pr-bot[bot]

/ok to test

Apr 08 '24 20:04 karthikeyann

/ok to test

Apr 10 '24 18:04 karthikeyann

/ok to test

Apr 11 '24 03:04 karthikeyann

eliminated the initialize_json_columns for filtered columns. first read_json is without filter. second read_json is with filter enabled. The numbers are for relative comparison only.

Apr 24 '24 04:04 karthikeyann

/ok to test

Apr 24 '24 04:04 karthikeyann

/ok to test

Apr 30 '24 16:04 karthikeyann

/ok to test

Apr 30 '24 16:04 karthikeyann

/ok to test

Apr 30 '24 17:04 karthikeyann

do we need to include this option in the java code as well?

Yes. @revans2 Should I include the java code changes as well in this PR?

May 01 '24 02:05 karthikeyann

/merge

May 01 '24 22:05 karthikeyann

/ok to test

May 02 '24 17:05 karthikeyann

cudf cudf copied to clipboard

Add JSON option to use dtypes as Filter

Description

Checklist

cudf
cudf copied to clipboard