polars
polars copied to clipboard
RFC: provide explicit functions for json operations
Describe your feature request
There are 3 distinct types of json operations we are using within the python apis
- Serde
- JSON
- NDJSON
I propose we explicitly separate the logic for these.
the inverse of the discussion in #3900 around describe_optimized_plan
the json operations are too ambigious to easily understand what the function is doing, and different parameters can greatly change the behavior of the function. I think we should split these out into more explicit function calls to make the code more readable.
The serde json operations should be moved under a .serialize()
method.
ndjson moved to *_json_lines
.
and the most common row_oriented
json remains under *_json()
refactor:
-
LazyFrame.write_json
->LazyFrame.serialize(format='json')
-
LazyFrame.read_json
->LazyFrame.deserialize(format='json')
-
LazyFrame.scan_json
->LazyFrame.scan_json_lines
-
DataFrame.write_json()
->DataFrame.serialize(format='json')
-
DataFrame.write_json(row_oriented=True)
->DataFrame.write_json()
-
DataFrame.write_json(json_lines=True)
->DataFrame.write_json_lines()
Alternativelywrite_ndjson
A .serialize
method could also be used to support many other common serde formats. (bincode, msgpack, etc.)
.serialize(format='json')
.serialize(format='msgpack')
.serialize(format='bincode')
I think that make sense. I do want to add to the docstrings that serialize(format="json")
should be preferred if performance is important. We serialize column oriented, so that should be much, much faster.
Shall we rename json_lines
to ndjson
?
Shall we rename json_lines to ndjson?
i think both terms are interchangeable, just a matter of preference i guess.
Regarding handling JSON
and NDJSON
in separate functions:
I don't have a real sense of how 'similar' JSON and NDJSON are. I know pandas handles this as read_json(lines=True)
. We do the same for reading eagerly, but in LazyFrame we have separate functions for JSON/NDJSON. We should at least be internally consistent. Your proposal to handle them separately seems sensible to me. I am working on implementing this in my PR here: #4721 .
Regarding serialize/deserialize
: I don't think I understand memory IO mechanics well enough to have a strong opinion here. In the code base currently, it seems we call this from_x
/to_x
instead of deserialize
/serialize
. I think that convention is generally OK. Using your proposal would create an umbrella function for all these, which could be nice. There are pros and cons. I don't feel very strongly about this.
I do think it's interesting that for JSON/NDJSON, you are proposing to 'break up' an umbrella function into separate functions, where for the other part you're proposing to create an umbrella function to incorporate various Serde actions. That seems inconsistent.
We should separate them. There will be an scan_ndjson
, but no scan_json
. The NDJSON
format is suited for our optimizations, where json
is not. We need to scan the whole JSON to resolve the contexts.
Another reason I think we should break them up, is that they are two different formats. I'd like the explicitly of that.
I do think it's interesting that for JSON/NDJSON, you are proposing to 'break up' an umbrella function into separate functions, where for the other part you're proposing to create an umbrella function to incorporate various Serde actions. That seems inconsistent.
Yes, I agree we can make these explicit functions, even though it boils down to using the Serde
architecture internally. For the python users this should be an implementation detail.
If I understand correctly, LazyFrame.from_json
and LazyFrame.to_json
are not really lazy operations. Should these methods even exist? The following API would be consistent with the rest of the code base:
File input:
-
pl.read_json
-> Read JSON eagerly, returns DataFrame (user must call.lazy()
to proceed in lazy mode) -
pl.read_ndjson
-> Read NDJSON eagerly, returns DataFrame -
pl.scan_ndjson
-> Read NDJSON lazily, returns LazyFrame
File output (user must first call .collect()
if in lazy mode):
-
DataFrame.write_json
-> Write to JSON file. -
DataFrame.write_ndjson
-> Write to NDJSON file.- If I understand correctly, we have no way to write NDJSON lazily without collecting first. If we do, we would need a
LazyFrame.write_ndjson
too.
- If I understand correctly, we have no way to write NDJSON lazily without collecting first. If we do, we would need a
Object input (user must call .lazy()
to proceed in lazy mode):
-
DataFrame(data=str)
-> New DataFrame constructor logic. If input data is a string, we will try to parse it as JSON.
Object output (user must first call .collect()
if in lazy mode):
-
DataFrame.to_json
-> Convert to JSON string.- Currently, this is implemented as
DataFrame.write_json(file=None)
. That is consistent withDataFrame.to_csv
. I think that's quite nice.DataFrame.to_json
could simply be an alias for this.
- Currently, this is implemented as
-
DataFrame.to_ndjson
-> Convert to NDJSON string.
Thoughts?
I think those two operations are definitely valuable. I agree that they are not 'technically' lazy operations. That was my initial reasoning behind the de/serialize
naming convention. It made it clear that it was constructing the lazyframe from a serialized json representation.
im not super familiar with the python dunder methods, but if there is something that we could leverage there that works with json.dumps
/ json.loads
maybe we could go that route instead?
I think those two operations are definitely valuable. I agree that they are not 'technically' lazy operations.
Then wouldn't you agree it makes sense to construct a DataFrame
instead, and call .lazy()
on it? Or to call .collect()
and then write?
The LazyFrame
isn't set up to be constructed directly. Currently, LazyFrame.from_json
is the ONLY reason to import LazyFrame
directly (that I can see). This is very inconsistent with the rest of the code.
the lazyframe json methods provide a functionality that doesnt exist in eager. It allows us to instantiate an in memory lazyframe from an a plan, instead of an eager dataset.
with the to/from json methods, we can pass off the logicalplan to any destination. For example, you could use this functionality to execute lazyframes on a remote server. Using your client to construct the lazyframe, the passing off the serialized frame to the server for execution. There are many potential use cases that this functionality allows.
The lazy representation of an eager frame (via df.lazy()
) contains the entirety of the serialized dataframe, which obviously does not scale well if you are trying to send that via http or other protocols. The lazy representation of something using scan_*
is much more lightweight though.
import polars as pl
eager_lf = pl.DataFrame({
'foo': [1, 2, 3, 4, 5],
'bar': [5, 4, 3, 2, 1]
}).lazy()
>>> eager_lf.write_json()
{
"DataFrameScan": {
"df": {
"columns": [
{ "name": "foo", "datatype": "Int64", "values": [1, 2, 3, 4, 5] },
{ "name": "bar", "datatype": "Int64", "values": [5, 4, 3, 2, 1] }
]
},
"schema": { "inner": { "foo": "Int64", "bar": "Int64" } },
"projection": null,
"selection": null
}
}
serializing a lazyframe that is not constructed via .lazy()
gives us a json payload only contains the instructions, and would easily be shareable across servers/platforms/languages.
>>> pl.scan_csv('./data.csv').write_json()
{
"CsvScan": {
"path": "./data.csv",
"schema": { "inner": { "foo": "Int64", "bar": "Utf8" } },
"options": {
"delimiter": 44,
"comment_char": null,
"quote_char": 34,
"eol_char": 10,
"has_header": true,
"skip_rows": 0,
"n_rows": null,
"with_columns": null,
"low_memory": false,
"ignore_errors": false,
"cache": true,
"null_values": null,
"rechunk": true,
"encoding": "Utf8",
"row_count": null,
"parse_dates": false,
"file_counter": 0
},
"predicate": null,
"aggregate": []
}
}
The potential applications that you can build on top of this functionality could be quite powerful & I don't think it is something that should be removed. I do agree that it is inconsistent with the rest of the code. Perhaps there is an alternative solution that provides the same functionality, and is consistent with the codebase?
I agree with @universalmind303 here of the utility and huge potential of serializing the logical plan.This would allow us to scale up by running polars on ray or dask.
But I think we should utilize the python architecture for serialization for this. What I mean is that read_*
, scan_*
and write_*
are for file formats that hold tabular data.
For serializing the logical plan. I'd like that pickle.dump/load
and json.dump/load
work as expected. Where pickle chooses a fast serialization format like bincode.
Thanks for the explanations! I am starting to understand this whole thing a lot better now. I misunderstood what the LazyFrame
methods actually did; they operate on logical plans, not on actual data.
That means the current docstrings and deprecation message are incorrect/confusing; we should fix those.
I'm thinking about the best way to handle this in the Python API, haven't figured it out yet...
Functionality for reading/writing NDJSON has been split into separate functions as read_ndjson
and DataFrame.write_ndjson
. There is no scan_ndjson
(yet).
But I think we should utilize the python architecture for serialization for this. What I mean is that read_, scan_ and write_* are for file formats that hold tabular data.
What I understand from this is that renaming LazyFrame.read/write_json
to LazyFrame.serialize/deserialize
as proposed would be appropriate, as the names read_*
/ scan_*
/ write_*
are reserved for tabular data. A format
argument could be added later if we support other formats.
I'll open a PR for this change.
For serializing the logical plan. I'd like that pickle.dump/load and json.dump/load work as expected. Where pickle chooses a fast serialization format like bincode.
We can pick this up as a separate issue.