cudf
cudf copied to clipboard
[BUG] Add support for `force_ascii=False` when writing to JSON with cuDF engine
Describe the bug
Ideally, we should eventually support engine="cudf" and force_ascii=False together with to_json. For now, we should update the documentation and/or provide a warning for users.
Steps/Code to reproduce bug
import cudf
df = cudf.DataFrame({"a": [1,2,3], "b": ["4","5","🌱"]})
df.to_json("test.jsonl", orient="records", lines=True, engine="cudf", force_ascii=False)
produces a TypeError: write_json() got an unexpected keyword argument 'force_ascii'.
I can do a df.to_json("test.jsonl", orient="records", lines=True, force_ascii=False) and see the emoji in the .jsonl file, and I can also do a df.to_json("test.jsonl", orient="records", lines=True, engine="cudf") and see the emoji represented as "\ud83c\udf31" in the .jsonl file. But I am unable to see the emoji represented as is in the file, while also writing with the cuDF engine.
Environment details I tested this with the latest cuDF version.
Adding to this, testing functions like to_csv seem to preserve the unicode encoding as-is while to_json converts utf-8 chars to the ascii representation before writing. So I'm guessing libcuDF does support directly writing utf-8 (maybe just not within to_json).
Is there any progress? Currently, there is also an exception when saving Chinese text.
No, this isn't something that we have prioritized yet unfortunately.
This feature is easy to implement. It skips the UTF-8/UTF-16 encoding. We need add the options and skip escape_strings_fn call at cudf/cpp/src/io/json/write_json.cu:548
It's a good first issue.