cudf icon indicating copy to clipboard operation
cudf copied to clipboard

[QST] Does the read_json() method support GPU acceleration?

Open tx2002 opened this issue 1 year ago • 4 comments
trafficstars

At first, I see this article: GPU-Accelerated JSON Data Processing with RAPIDS I follow it to use the cudf.read_json(), but I get the warning UserWarning: Using CPU via Pandas to read JSON dataset, this may be GPU accelerated in the future and I use %%cudf.pandas.line_profile, it shows there is no GPU TIME. image

But, when I load the cudf before by running %load_ext cudf.pandas and I change import cudf as pd to import pandas as pd It still has the warning, but show the GPU TIME. So I want to know does the read_json() method support GPU acceleration? image

tx2002 avatar Dec 25 '23 10:12 tx2002

Thank you @tx2002 for raising this issue. I believe the root cause is that cudf only supports GPU-accelerated JSON reading with orient="records" when also lines=True.

If you share a bit more about the contents of your string json_data I would be happy to help troubleshoot.

GregoryKimball avatar Jan 25 '24 05:01 GregoryKimball

Thank you for your reply. Actually, I run the same code based on the same json_data. The only difference is the way to import cudf. My json_data is like this:

[{

"id":"1",

"Col_01":"test",

"Col_02":"77"

},

{

"id":"2",

"Col_01":"test",

"Col_02":"13552652142"

},

{

"id":"3",

"Col_01":"test",

"Col_02":""

},

{

"id":"4",

"Col_01":"test",

"Col_02":""

},

{

"id":"5",

"Col_01":"test",

"Col_02":"test"

}]

tx2002 avatar Jan 25 '24 05:01 tx2002

This is readable with orient="records", lines=False. Following code works.

In [4]: df = cudf.read_json(StringIO(json_data), orient="records",  lines=False, engine="cudf")
In [5]: df
Out[5]: 
  id Col_01       Col_02
0  1   test           77
1  2   test  13552652142
2  3   test             
3  4   test             
4  5   test         test

In [6]: df.info()
<class 'cudf.core.dataframe.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   id      5 non-null      object
 1   Col_01  5 non-null      object
 2   Col_02  5 non-null      object
dtypes: object(3)
memory usage: 114.0+ bytes

karthikeyann avatar Feb 06 '24 16:02 karthikeyann

cudf.read_json uses cudf engine for JSON Lines only. it doesn’t use cudf engine automatically for other cases. https://github.com/rapidsai/cudf/blob/20ed009003944be776e28c26301354be287726f9/python/cudf/cudf/io/json.py#L60-L61

Right now, libcudf Nested JSON reader will support orient="records" and orient="values" with lines=False/True (all 4 combinations should work). Could we enable it automatically for other supported formats as well?

karthikeyann avatar Feb 06 '24 16:02 karthikeyann

I'll add that the profiler output is probably confusing in this case because the profiler's GPU vs CPU time columns really says "did we run a cudf or a pandas function", but in this case the cudf function actually dispatches to the CPU under the hood so even if you run a cudf function directly and see 100% GPU time what you're really seeing is that the function call was done by cudf but ultimately it still ran on the CPU.

vyasr avatar Nov 07 '24 20:11 vyasr