cudf icon indicating copy to clipboard operation
cudf copied to clipboard

[BUG] Add support for `force_ascii=False` when writing to JSON with cuDF engine

Open sarahyurick opened this issue 1 year ago • 3 comments

Describe the bug Ideally, we should eventually support engine="cudf" and force_ascii=False together with to_json. For now, we should update the documentation and/or provide a warning for users.

Steps/Code to reproduce bug

import cudf

df = cudf.DataFrame({"a": [1,2,3], "b": ["4","5","🌱"]})
df.to_json("test.jsonl", orient="records", lines=True, engine="cudf", force_ascii=False)

produces a TypeError: write_json() got an unexpected keyword argument 'force_ascii'.

I can do a df.to_json("test.jsonl", orient="records", lines=True, force_ascii=False) and see the emoji in the .jsonl file, and I can also do a df.to_json("test.jsonl", orient="records", lines=True, engine="cudf") and see the emoji represented as "\ud83c\udf31" in the .jsonl file. But I am unable to see the emoji represented as is in the file, while also writing with the cuDF engine.

Environment details I tested this with the latest cuDF version.

sarahyurick avatar Mar 01 '24 22:03 sarahyurick

Adding to this, testing functions like to_csv seem to preserve the unicode encoding as-is while to_json converts utf-8 chars to the ascii representation before writing. So I'm guessing libcuDF does support directly writing utf-8 (maybe just not within to_json).

ayushdg avatar Mar 01 '24 23:03 ayushdg

Is there any progress? Currently, there is also an exception when saving Chinese text.

simplew2011 avatar Jun 13 '24 03:06 simplew2011

No, this isn't something that we have prioritized yet unfortunately.

vyasr avatar Jun 24 '24 23:06 vyasr

This feature is easy to implement. It skips the UTF-8/UTF-16 encoding. We need add the options and skip escape_strings_fn call at cudf/cpp/src/io/json/write_json.cu:548 It's a good first issue.

karthikeyann avatar Sep 27 '24 02:09 karthikeyann