cudf icon indicating copy to clipboard operation
cudf copied to clipboard

Add JSON option to use dtypes as Filter

Open karthikeyann opened this issue 1 year ago • 9 comments
trafficstars

Description

Resolves https://github.com/rapidsai/cudf/issues/14951 This adds an option use_dtypes_as_filter to json_reader_options (default False) When set to True, the dtypes is used as filter instead of type inference suggestion. If dtypes (vector of dtypes, map of dtypes or nested schema), is not specified, output is empty dataframe.

Checklist

  • [x] I am familiar with the Contributing Guidelines.
  • [x] New or existing tests cover these changes.
  • [ ] The documentation is up to date with these changes.

karthikeyann avatar Feb 07 '24 17:02 karthikeyann

Profiled on GV100 machine. Reading JSON with 512 columns, 10k rows without filter image

Reading 1 columns out of JSON with 512 columns, 10k rows. ~(with filter 1 row)~ (with filter 1 column) image

unnecesary parse_data() calls are eliminated. It's possible to eliminate the initialize_json_columns() calls as well (but runtime impact is less, memory usage will reduce, and depends on map type PR #14936)

karthikeyann avatar Feb 07 '24 17:02 karthikeyann

Thank you @karthikeyann, this is a great demonstration! When you mention:

Reading 1 columns out of JSON with 512 columns, 10k rows. (with filter 1 row)

What do you mean by "filter 1 row"?

GregoryKimball avatar Feb 07 '24 21:02 GregoryKimball

What do you mean by "filter 1 row"?

Sorry. I meant to type "filter 1 column".

keys.json content in each line: {"key_109": "value0", "key_200": "value0", "key_342": "value0", ... } (500 keys out of 512 columns in each row)

import cudf
import nvtx
# read all 512 columns
with nvtx.annotate("read_json", color="purple"):
    df = cudf.read_json(open("keys.json"), engine="cudf", lines=True)
# read only 1 column
with nvtx.annotate("read_json", color="purple"):
    df = cudf.read_json(open("keys.json"), engine="cudf", lines=True, dtype={"key_10": str}, use_dtypes_as_filter=True)

karthikeyann avatar Feb 08 '24 06:02 karthikeyann

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

copy-pr-bot[bot] avatar Apr 08 '24 20:04 copy-pr-bot[bot]

/ok to test

karthikeyann avatar Apr 08 '24 20:04 karthikeyann

/ok to test

karthikeyann avatar Apr 10 '24 18:04 karthikeyann

/ok to test

karthikeyann avatar Apr 11 '24 03:04 karthikeyann

eliminated the initialize_json_columns for filtered columns. first read_json is without filter. second read_json is with filter enabled. image The numbers are for relative comparison only.

karthikeyann avatar Apr 24 '24 04:04 karthikeyann

/ok to test

karthikeyann avatar Apr 24 '24 04:04 karthikeyann

/ok to test

karthikeyann avatar Apr 30 '24 16:04 karthikeyann

/ok to test

karthikeyann avatar Apr 30 '24 16:04 karthikeyann

/ok to test

karthikeyann avatar Apr 30 '24 17:04 karthikeyann

do we need to include this option in the java code as well?

Yes. @revans2 Should I include the java code changes as well in this PR?

karthikeyann avatar May 01 '24 02:05 karthikeyann

/merge

karthikeyann avatar May 01 '24 22:05 karthikeyann

/ok to test

karthikeyann avatar May 02 '24 17:05 karthikeyann