cudf
cudf copied to clipboard
[FEA] [JSON reader] to support column prune
This is part of FEA of https://github.com/NVIDIA/spark-rapids/issues/9
We have a JSON file with below lines
{"name":"Michael"}
{"name":"Andy", "age":30}
{"name":"Justin", "age":19}
when specifying the reading column name only to (name, string type), json reader throws exception
ai.rapids.cudf.CudfException: cuDF failure at: /home/bobwang/work.d/nvspark/cudf/cpp/src/io/json/reader_impl.cu:421: Must specify types for all columns
Looks like JSON reader requires all column names or inferring schema without column names.
So we hope JSON reader can read the columns that users specified, instead of specifying all column names.
This is a blocker for Spark to be able to use the JSON reader. Because we do not know all of the columns, the user just gives the ones that they want to read.
This is a blocker for Spark to be able to use the JSON reader. Because we do not know all of the columns, the user just gives the ones that they want to read.
Would it be viable to read all columns and then select the ones of interest?
Not totally. In general we rely on Spark to tell us the schema of the data we want to read and then we pass it on to CUDF to select the correct columns and return them to us in the format we want. The java API does not even have a way to tell us what the columns are that were returned. We definitely need to fix that anyways. But, even if it did tell us the names of all of the columns, we would have to ask cudf to resolve the schema for us each time. Then once it is done we would throw away the columns we didn't want and cast all of the columns we did find into the schema that Spark requested. This could work, but is not an ideal long term solution. Especially because Spark parses a lot of values very differently from how CUDF does, and part of the plan was to ask CUDF to return everything as strings so we could use our customized code to try and parse them into values in a way that is much closer to how Spark does it. I don't see how we can ask for everything to be strings and not know what the columns are that we want up front. This does not have to be done for 22.02. It would be great if it is done in time, but we have already decided that JSON parsing will be off by default in Spark for 22.02 as just an experimental feature that someone could try.
I'm asking as a short term solution, because column pruning would need to be reworked when we add nested type support.
Does this feature request include pruning of nested columns (same as https://github.com/rapidsai/cudf/issues/8848)?
Long term yes we would want to be able to prune child columns as well. Unless the change is simple in the short term I would rather have us concentrate on getting a long term solution.
This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.
I'm currently trying to figure out what the interface for this would look like for the new nested JSON reader.
Would it be sufficient to take a nested schema of the columns that are to be selected and that any [child] column that is not explicitly selected in that schema would not appear in the nested data being returned?
JSON lines input:
{"a":0.0,"b":{"x":0.10, "y":0.11}}
{"a":1.0,"b":{"x":1.10, "y":1.11}}
Schema:
├─ a/
├─ b/
│ ├─ b.x
│ ├─ b.y
-- EX 1 --
Select schema:
[a, b:[x,y]]
Schema returned:
├─ a/
├─ b/
│ ├─ b.x
│ ├─ b.y
-- EX 2 --
Select schema:
[a, b:[y]]
Schema returned:
├─ a/
├─ b/
│ ├─ b.y
-- EX 3 --
Select schema:
[b]
Schema returned:
├─ b/
...which, would just be a struct column with validity and no child columns (as no child columns were _selected_)
Would it be sufficient to take a nested schema of the columns that are to be selected and that any [child] column that is not explicitly selected in that schema would not appear in the nested data being returned?
Yes that would work for us. We have the full list of what we want to read.
After doing some testing on the 23.02 branch, the nested JSON reader no longer throws when dtype is specified for a subset of columns:
df = cudf.read_json('{"a": 1}\n{"b":1}', lines=True, dtype={'a':'int'}, engine='cudf_experimental')
a b
0 1 <NA>
1 <NA> 1.0
Reading and infering types for unspecified columns seems like the desired behavior. If we wanted to drop unspecified columns as a performance improvement I expect the results would be underwhelming due to all the parsing work we would still have to do.
Please let me know if this issue is still needed.
I believe we can close this in favor of https://github.com/rapidsai/cudf/issues/13473