cudf
cudf copied to clipboard
[QST] Does the read_json() method support GPU acceleration?
At first, I see this article: GPU-Accelerated JSON Data Processing with RAPIDS
I follow it to use the cudf.read_json(), but I get the warning
UserWarning: Using CPU via Pandas to read JSON dataset, this may be GPU accelerated in the future
and I use %%cudf.pandas.line_profile, it shows there is no GPU TIME.
But, when I load the cudf before by running %load_ext cudf.pandas
and I change import cudf as pd to import pandas as pd
It still has the warning, but show the GPU TIME.
So I want to know does the read_json() method support GPU acceleration?
Thank you @tx2002 for raising this issue. I believe the root cause is that cudf only supports GPU-accelerated JSON reading with orient="records" when also lines=True.
If you share a bit more about the contents of your string json_data I would be happy to help troubleshoot.
Thank you for your reply. Actually, I run the same code based on the same json_data. The only difference is the way to import cudf.
My json_data is like this:
[{
"id":"1",
"Col_01":"test",
"Col_02":"77"
},
{
"id":"2",
"Col_01":"test",
"Col_02":"13552652142"
},
{
"id":"3",
"Col_01":"test",
"Col_02":""
},
{
"id":"4",
"Col_01":"test",
"Col_02":""
},
{
"id":"5",
"Col_01":"test",
"Col_02":"test"
}]
This is readable with orient="records", lines=False. Following code works.
In [4]: df = cudf.read_json(StringIO(json_data), orient="records", lines=False, engine="cudf")
In [5]: df
Out[5]:
id Col_01 Col_02
0 1 test 77
1 2 test 13552652142
2 3 test
3 4 test
4 5 test test
In [6]: df.info()
<class 'cudf.core.dataframe.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 5 non-null object
1 Col_01 5 non-null object
2 Col_02 5 non-null object
dtypes: object(3)
memory usage: 114.0+ bytes
cudf.read_json uses cudf engine for JSON Lines only. it doesn’t use cudf engine automatically for other cases.
https://github.com/rapidsai/cudf/blob/20ed009003944be776e28c26301354be287726f9/python/cudf/cudf/io/json.py#L60-L61
Right now, libcudf Nested JSON reader will support orient="records" and orient="values" with lines=False/True (all 4 combinations should work).
Could we enable it automatically for other supported formats as well?
I'll add that the profiler output is probably confusing in this case because the profiler's GPU vs CPU time columns really says "did we run a cudf or a pandas function", but in this case the cudf function actually dispatches to the CPU under the hood so even if you run a cudf function directly and see 100% GPU time what you're really seeing is that the function call was done by cudf but ultimately it still ran on the CPU.